Understanding String, str, and UTF-8 byte arrays in Rust

Hey there! Ever wonder how we’ve managed to squeeze every language from the intricate scripts of Mandarin to the hieroglyphs of ancient…

Understanding String, str, and UTF-8 byte arrays in Rust

Hey there! Ever wonder how we’ve managed to squeeze every language from the intricate scripts of Mandarin to the hieroglyphs of ancient Egypt onto our digital screens? Or how a simple emoji can travel unscathed from a phone in Tokyo to a laptop in Buenos Aires? Well, the hero behind this linguistic harmony is something called Unicode. It’s like the Rosetta Stone of the digital age, and it’s pretty darn cool.

Now, if you’re dipping your toes into the ocean of coding, especially in a language like Rust, you’ll find that handling text data isn’t just about stringing letters together. We’ve got to talk about String, str, and those UTF-8 byte arrays—believe me, they're the backbone of text manipulation. It might sound like alphabet soup right now, but hang tight. We're about to unravel this mystery together, making it as easy as pie (or should I say, as simple as 'println!' in Rust?).

So, get your geek hat on, and let’s decode these concepts, pun intended!

Understanding UTF-8

UTF-8 stands for “Unicode Transformation Format — 8 bits”. It is a method for encoding Unicode characters as a sequence of bytes that is both space-efficient and backward compatible with ASCII. Here’s a detailed look at UTF-8 and how it represents data.

Unicode is a comprehensive character encoding system designed to represent text from languages around the world. It’s a universal standard that includes characters, symbols, and emojis, ensuring consistent encoding, representation, and handling of text across different digital platforms and systems. Here’s an in-depth look at Unicode:

Goals of Unicode

  1. Universality: To provide a unique number (code point) for every character, regardless of platform, program, or language.
  2. Efficiency: To support the efficient storage and transmission of text.
  3. Unification: To unify different language encoding schemes, which helps to avoid confusion and errors in text processing.

Code Points and Planes

  • Code Points: In Unicode, each character (including letters, symbols, control characters, etc.) is assigned a unique “code point”. A code point is essentially an integer value that maps to a particular character. For example, the character ‘A’ has a code point of U+0041, where “U+” signifies a Unicode code point, and “0041” is a hexadecimal number representing the character.
  • Planes: Unicode characters are divided into 17 “planes”, each containing 65,536 code points. The first plane (Plane 0), known as the Basic Multilingual Plane (BMP), contains the most commonly used characters. The other planes (1 through 16) are called “supplementary planes” and include less common, historical, and specialized characters.

Encoding Forms

Unicode defines several encoding forms that determine how code points are mapped into byte sequences:

  • UTF-32/UCS-4: A fixed-length encoding using 32 bits for each Unicode code point. It’s simple but not space-efficient because it always uses four bytes, even for ASCII characters that need only one byte.
  • UTF-16: A variable-length encoding that uses 2 bytes for characters in the BMP and 4 bytes for characters in the supplementary planes. It’s more space-efficient than UTF-32 but still uses more space than necessary for ASCII characters.
  • UTF-8: As explained earlier, UTF-8 is a variable-width encoding that uses 1 to 4 bytes per code point. It’s the most space-efficient for texts primarily composed of ASCII characters, which is why it’s widely used on the internet.

Why do we use UTF-8 and not UTF-16 or UTF-32?

The choice between UTF-8, UTF-16, and UTF-32 often boils down to a trade-off between the size of the data and the complexity of processing it. Here’s why UTF-8 has become the dominant encoding:

Size Efficiency

UTF-8 is incredibly efficient for texts that are primarily in English or consist of ASCII characters, as it represents these characters in just one byte. Given that a significant amount of computer data (especially code) is in English, UTF-8 saves a lot of space compared to UTF-16 and UTF-32, where the smallest unit is two and four bytes, respectively.

Compatibility with ASCII

UTF-8 is backward compatible with ASCII. This means that any ASCII text is also valid UTF-8 without any conversion, making it easy to work with legacy systems and software that was originally designed for ASCII.

Network Transmission

For data transmission, especially over the internet, bandwidth can be a concern. UTF-8 tends to use less data to represent the same characters compared to UTF-16 and UTF-32, particularly for Western languages. It’s been a crucial factor in its adoption for web pages, APIs, and data interchange formats like JSON and XML.

Incremental Processing

UTF-8 has the benefit that it can be read and written as a stream of bytes because a byte does not depend on context from surrounding bytes. This means that you can start reading at any point in a UTF-8 stream and quickly synchronize with character boundaries, which is helpful for robustness in transmission and storage systems.

Endianness

Endianness refers to the order of byte serialization and is an issue for UTF-16 and UTF-32 because they are multi-byte encodings that can be written in both big-endian and little-endian formats. This requires a mechanism (like a Byte Order Mark — BOM) to indicate which order the bytes are in. UTF-8 does not have this problem, which simplifies its use across different platforms.

Wide Adoption and Support

The combination of these factors has led to UTF-8 being widely adopted and supported across many operating systems, programming languages, libraries, and applications. This widespread adoption creates a positive feedback loop — since everyone else is using UTF-8, it becomes the default choice for new systems and software.

However, UTF-8 is not always the best choice. For texts that consist heavily of non-Latin characters, UTF-16 may be more efficient because it can represent most characters in just two bytes instead of three or four. And UTF-32 can be preferable in situations where memory is not an issue, and fixed-width characters simplify text processing — although such situations are less common.

In conclusion, UTF-8 strikes a good balance between space efficiency for ASCII characters, compatibility, and simplicity for network transmission, which has led to its prevalence in many applications, especially on the web.

Normalization

Unicode normalization is the process of converting text into a consistent format. It’s essential because some characters can be represented in multiple ways. For example, the letter “é” can be represented as a single code point U+00E9 or as a combination of “e” (U+0065) and an acute accent (U+0301). Normalization ensures that these equivalent sequences are treated consistently in applications.

Collation

Collation refers to the ordering of characters in a way that aligns with the conventions and expectations of human languages. Unicode provides guidelines for collation, which can be complex due to differences in how various languages handle sorting.

Case Folding

Unicode also specifies case folding rules, which are similar to lowercase conversion but are designed for case-insensitive comparisons. Case folding maps characters in a way that disregards case, providing a consistent way to compare strings in a case-insensitive manner.

Analogies to Understand Unicode

  1. Unicode as a Library: Imagine Unicode as a vast library, where every book represents a different language or set of symbols, and every character in those books is a page with a unique page number (the code point).
  2. Planes as Floors in a Building: The Unicode planes can be likened to different floors in a large building. The ground floor (BMP) has the rooms (characters) we use every day, while the upper floors (supplementary planes) have more specialized suites (characters) that are used less frequently.
  3. Normalization as Standardizing Recipes: Different chefs might have their unique way of writing down a recipe for the same dish. Normalization is like creating a standard recipe format so that no matter who writes it, the ingredients and steps are presented consistently.

Understanding Unicode is key to developing software that is culturally and linguistically inclusive, ensuring that it can be used and appreciated by a global audience.

UTF-8’s Variable Width

The key feature of UTF-8 is that it is a variable-width encoding. This means that it uses only as many bytes as necessary for each character. This efficiency makes UTF-8 very popular for storing and transmitting text, especially for languages where many characters can be represented with 1-byte sequences.

Examples

  • The ASCII character ‘A’ (U+0041) is represented in UTF-8 simply as 0x41 (in hexadecimal notation), which is the same as its ASCII representation.
  • The Euro symbol ‘€’ (U+20AC) requires three bytes in UTF-8: 0xE2 0x82 0xAC.
  • An emoji like ‘😊’ (U+1F60A) is encoded with four bytes: 0xF0 0x9F 0x98 0x8A.

Analogies

  1. Variable-width encoding as a train: Think of UTF-8 encoding like a train that can change its length depending on the number of passengers (characters). For ASCII characters, a small one-car train suffices. As the characters become more complex, the train adds more cars (bytes) to accommodate them.
  2. 1-byte sequences as postcards: ASCII characters in UTF-8 can be thought of as postcards that require minimal space (a single byte) and are simple enough to send as-is, without extra packaging.
  3. Multibyte sequences as parcels: Characters beyond ASCII are like parcels that require extra packaging (additional bytes). The more unusual the item (character), the more packaging layers (bytes) are needed.
  4. Compatibility with ASCII as a bilingual person: UTF-8’s compatibility with ASCII is like a bilingual person who speaks both English and another complex language. They can communicate easily in English (ASCII) using short, simple words (1-byte sequences). But for more nuanced concepts (non-ASCII characters), they switch to the complex language, using longer phrases (multi-byte sequences).

The String Type

In Rust, String is a growable, mutable, owned, UTF-8 encoded string type. When you want to create a string that can change at runtime, you use a String. You can think of String as a vector of bytes (Vec<u8>), but with a twist: it ensures that its contents are always valid UTF-8 sequences.

Creating a String

let mut s = String::new(); // create an empty String 
s.push_str("hello"); // push a &str onto the String

Analogy

Think of String as a bookshelf that you own. You can add books (push characters or strings), take them away, or rearrange them (mutate the String) as much as you like.

The str Type

The str type, often seen in its borrowed form &str, is an immutable sequence of UTF-8 bytes. It is commonly referred to as a "string slice". A &str is a reference to a string and is the preferred way to pass strings around in Rust because it is more efficient than passing around owned String objects.

Creating a &str

let s = "hello"; // this is a &str

This &str is actually a slice pointing to a specific point of the binary's read-only memory, which is why &str is immutable.

Analogy

You can think of &str as a bookmark. It doesn’t own the book (String); it just marks a place in it, referring to a specific passage or the whole text.

Converting Between String and &str

You can easily convert between a String and a &str:

let s = String::from("hello"); // Convert a &str to a String 
let slice = &s; // Borrow the String as a &str

Analogy

Imagine going to the library (borrowing a &str) vs. buying the book (String). When you borrow it, you can't change it and have to give it back, reflecting the borrowing and immutability concepts in Rust.


Download Now!


UTF-8 Byte Arrays

Sometimes you need to interact with raw bytes. In Rust, a UTF-8 encoded String or &str can be viewed as a byte array, which is useful when you need to interface with systems or libraries that don't understand Rust strings but do understand bytes.

Example: String to Bytes

let s = String::from("hello"); 
let bytes = s.as_bytes(); // Convert the String to a UTF-8 byte array

Analogy

This is like getting the ASCII codes for each letter in your book (String), giving you a numerical representation of your text.

When to Use Each

  • Use String when you need owned, mutable data. For example, when you're building a string or modifying it at runtime.
  • Use &str when you just need to read or pass around string data without ownership. This is common in function arguments and for static strings.
  • Use byte arrays ([u8] or Vec<u8>) when you need to operate at the byte level, such as when dealing with files or network data, or when interfacing with non-Rust codebases or libraries.

In Practice

fn greet(name: &str) -> String { 
    format!("Hello, {}!", name) // format! macro returns a String 
} 
 
fn main() { 
    let name = "Alice"; 
    let greeting = greet(name); 
    println!("{}", greeting); 
 
    let bytes = greeting.as_bytes(); 
    for byte in bytes { 
        println!("{}", byte); // prints the UTF-8 bytes of the string 
    } 
}

In the example above:

  1. We define a function greet that takes a &str and returns a String.
  2. Inside main, we call greet with a &str literal and receive a String in return.
  3. We then print out each byte of the String as a UTF-8 byte array.

Rust Memory Allocation for Different String Types

Allocation of String Types

String

A String is a heap-allocated data structure. It is essentially a wrapper around a Vec<u8>, which represents a buffer of UTF-8 bytes. Since the size of the string can change, it needs to be allocated on the heap. The String type itself is stored on the stack, but the data it points to is on the heap.

let mut s = String::from("hello"); // 's' is on the stack, its data is on the heap

When you mutate the string, for example by using push_str to append more characters, Rust may need to allocate more space on the heap to accommodate the changes.

Allocation of str Types

&str

The &str type is an immutable slice that references a sequence of UTF-8 bytes. The str itself doesn’t have a size known at compile time—it’s a dynamically-sized type (DST). Thus, you can’t have a plain str on the stack. Instead, you use it as &str, which is a reference to a str.

Here’s how &str can be allocated:

  • Stack: When you have a string literal in your Rust program, the actual bytes of that string are embedded directly in the final binary and are therefore stack-allocated. A &str can be a reference to this stack-allocated data.
let s = "hello"; // 's' is a reference on the stack to data also on the stack
  • Heap: If you take a slice of a String, you get a &str that points to the data on the heap. Here, the &str itself (the reference) is on the stack, but it points to the heap-allocated buffer of the String.
let s = String::from("hello"); let slice = &s[..]; // 'slice' is on the stack, pointing to data on the heap

Allocation of UTF-8 Byte Arrays

When dealing with raw bytes, you may work with [u8] or Vec<u8>. Here's how they are allocated:

Stack: A fixed-size array of bytes, like [u8; 5], is allocated on the stack.

let bytes: [u8; 5] = [104, 101, 108, 108, 111]; // stack-allocated fixed-size array

Heap: If you have a Vec<u8>, the Vec structure is on the stack but the data it points to is on the heap, much like a String.

let bytes = vec![104, 101, 108, 108, 111]; // 'bytes' is on the stack, its data is on the heap

Rust’s ownership system ensures that heap-allocated memory is automatically freed when the owner of the data goes out of scope, preventing memory leaks.

When to use String versus str

Choosing between String and &str in Rust depends on several factors such as ownership, lifetime, and mutability of the data you are working with. Here's a guideline on when to use each:

Use String when:

You need ownership: If your data needs to be owned by a particular variable, for instance when you’re returning a string from a function and want to transfer ownership outside that function, you should use String.

fn create_welcome_message(name: &str) -> String {     format!("Welcome, {}!", name) // Returns an owned String }

You need to modify or mutate the string: If you’re appending characters, concatenating strings, or otherwise changing the content, String is your go-to since &str is immutable.

let mut s = String::from("hello"); s.push_str(", world!"); // Mutates the string by appending

The size is unknown or variable at compile time: Whenever you build a string dynamically, such as reading from a file or user input, you cannot know the size at compile time, so you use a String.

let mut s = String::new(); io::stdin().read_line(&mut s).expect("Failed to read line"); // Reads user input into a String

Use &str when:

You are dealing with string literals or fixed strings: Since string literals are known at compile time and are immutable, they are naturally &str. They are fast and efficient because they are embedded in the binary and don’t require allocation on the heap.

let s = "This is a fixed string"; // This is a string slice (&str)

You need to borrow a string: If you just need to read or inspect the string without taking ownership, use a &str. This is very common in function arguments that don't need to mutate or keep the string.

fn print_message(message: &str) {     println!("{}", message); // Borrowing a string slice, not taking ownership }

Performance considerations: Borrowing a &str is typically faster than using a String because it does not involve heap allocation. If a function can work with a borrowed slice, it’s usually a good default choice.

fn string_length(s: &str) -> usize {     s.len() // Just borrows and checks the length of the string slice }

You are slicing strings: When you take a substring of another string (which can be a String or &str), you are creating a &str.

let s = String::from("hello"); let slice = &s[0..2]; // slice is a &str

For generic functions: Functions that can accept both String and &str can be written to take a &str argument, making them more flexible.

fn takes_slice(s: &str) {     // ... }  let owned_string = String::from("hello"); let string_literal = "hello"; takes_slice(&owned_string); // Works with &String takes_slice(string_literal); // Works with string literals

Remember, using &str whenever possible can improve the performance of your application, as it avoids unnecessary memory allocation. However, String becomes necessary when you need to own or modify the string data. Rust's type system and borrowing rules help to enforce the proper use of these string types, guiding you towards writing safe and efficient code.

Conclusion

Phew! That was quite the journey, wasn’t it? From understanding the different flavors of strings in Rust to decoding the intricacies of UTF-8, we’ve covered a lot of ground. Think of String and str as the yin and yang of text in Rust, each with its place and purpose. And let's not forget our byte-sized buddy, UTF-8, who keeps our text consistent across platforms worldwide.

Remember, Unicode is like the DNA of digital text — it’s complex, but without it, we wouldn’t have the rich, diverse communication we enjoy across our global village of gadgets and gizmos. With the knowledge you’ve gained, you’re now equipped to venture forth and craft some truly universal code that speaks in every language under the sun. So go ahead, make your mark in this polyglot world of programming, and let your Rust code sing in perfect harmony with Unicode!

And when you next send a smiley to a friend or write that polyglot application, tip your hat to the silent, steadfast standard that made it all possible — Unicode, with a little help from its trusty sidekick, UTF-8. Happy coding!

Check out some interesting hands-on Rust articles!

🌟 Developing a Fully Functional API Gateway in Rust — Discover how to set up a robust and scalable gateway that stands as the frontline for your microservices.

🌟 Implementing a Network Traffic Analyzer — Ever wondered about the data packets zooming through your network? Unravel their mysteries with this deep dive into network analysis.

🌟 Building an Application Container in Rust — Join us in creating a lightweight, performant, and secure container from scratch! Docker’s got nothing on this.

🌟 Implementing a P2P Database in Rust: Today, we’re going to roll up our sleeves and get our hands dirty building a Peer-to-Peer (P2P) key-value database.

🌟 Building a Function-as-a-Service (FaaS) in Rust: If you’ve been exploring cloud computing, you’ve likely come across FaaS platforms like AWS Lambda or Google Cloud Functions. In this article, we’ll be creating our own simple FaaS platform using Rust.

🌟 Building an Event Broker in Rust: We’ll explore essential concepts such as topics, event production, consumption, and even real-time event subscriptions.

Download Now!

Read more articles about Rust in my Rust Programming Library!

Visit my Blog for more articles, news, and software engineering stuff!

Follow me on Medium, LinkedIn, and Twitter.

Leave a comment, and drop me a message!

All the best,

Luis Soares

CTO | Tech Lead | Senior Software Engineer | Cloud Solutions Architect | Rust 🦀 | Golang | Java | ML AI & Statistics | Web3 & Blockchain

Read more