Understanding String, str, and UTF-8 byte arrays in Rust
Hey there! Ever wonder how we’ve managed to squeeze every language from the intricate scripts of Mandarin to the hieroglyphs of ancient…
Hey there! Ever wonder how we’ve managed to squeeze every language from the intricate scripts of Mandarin to the hieroglyphs of ancient Egypt onto our digital screens? Or how a simple emoji can travel unscathed from a phone in Tokyo to a laptop in Buenos Aires? Well, the hero behind this linguistic harmony is something called Unicode. It’s like the Rosetta Stone of the digital age, and it’s pretty darn cool.
Now, if you’re dipping your toes into the ocean of coding, especially in a language like Rust, you’ll find that handling text data isn’t just about stringing letters together. We’ve got to talk about String
, str
, and those UTF-8 byte arrays—believe me, they're the backbone of text manipulation. It might sound like alphabet soup right now, but hang tight. We're about to unravel this mystery together, making it as easy as pie (or should I say, as simple as 'println!' in Rust?).
So, get your geek hat on, and let’s decode these concepts, pun intended!
Understanding UTF-8
UTF-8 stands for “Unicode Transformation Format — 8 bits”. It is a method for encoding Unicode characters as a sequence of bytes that is both space-efficient and backward compatible with ASCII. Here’s a detailed look at UTF-8 and how it represents data.
Unicode is a comprehensive character encoding system designed to represent text from languages around the world. It’s a universal standard that includes characters, symbols, and emojis, ensuring consistent encoding, representation, and handling of text across different digital platforms and systems. Here’s an in-depth look at Unicode:
Goals of Unicode
- Universality: To provide a unique number (code point) for every character, regardless of platform, program, or language.
- Efficiency: To support the efficient storage and transmission of text.
- Unification: To unify different language encoding schemes, which helps to avoid confusion and errors in text processing.
Code Points and Planes
- Code Points: In Unicode, each character (including letters, symbols, control characters, etc.) is assigned a unique “code point”. A code point is essentially an integer value that maps to a particular character. For example, the character ‘A’ has a code point of U+0041, where “U+” signifies a Unicode code point, and “0041” is a hexadecimal number representing the character.
- Planes: Unicode characters are divided into 17 “planes”, each containing 65,536 code points. The first plane (Plane 0), known as the Basic Multilingual Plane (BMP), contains the most commonly used characters. The other planes (1 through 16) are called “supplementary planes” and include less common, historical, and specialized characters.
Encoding Forms
Unicode defines several encoding forms that determine how code points are mapped into byte sequences:
- UTF-32/UCS-4: A fixed-length encoding using 32 bits for each Unicode code point. It’s simple but not space-efficient because it always uses four bytes, even for ASCII characters that need only one byte.
- UTF-16: A variable-length encoding that uses 2 bytes for characters in the BMP and 4 bytes for characters in the supplementary planes. It’s more space-efficient than UTF-32 but still uses more space than necessary for ASCII characters.
- UTF-8: As explained earlier, UTF-8 is a variable-width encoding that uses 1 to 4 bytes per code point. It’s the most space-efficient for texts primarily composed of ASCII characters, which is why it’s widely used on the internet.
Why do we use UTF-8 and not UTF-16 or UTF-32?
The choice between UTF-8, UTF-16, and UTF-32 often boils down to a trade-off between the size of the data and the complexity of processing it. Here’s why UTF-8 has become the dominant encoding:
Size Efficiency
UTF-8 is incredibly efficient for texts that are primarily in English or consist of ASCII characters, as it represents these characters in just one byte. Given that a significant amount of computer data (especially code) is in English, UTF-8 saves a lot of space compared to UTF-16 and UTF-32, where the smallest unit is two and four bytes, respectively.
Compatibility with ASCII
UTF-8 is backward compatible with ASCII. This means that any ASCII text is also valid UTF-8 without any conversion, making it easy to work with legacy systems and software that was originally designed for ASCII.
Network Transmission
For data transmission, especially over the internet, bandwidth can be a concern. UTF-8 tends to use less data to represent the same characters compared to UTF-16 and UTF-32, particularly for Western languages. It’s been a crucial factor in its adoption for web pages, APIs, and data interchange formats like JSON and XML.
Incremental Processing
UTF-8 has the benefit that it can be read and written as a stream of bytes because a byte does not depend on context from surrounding bytes. This means that you can start reading at any point in a UTF-8 stream and quickly synchronize with character boundaries, which is helpful for robustness in transmission and storage systems.
Endianness
Endianness refers to the order of byte serialization and is an issue for UTF-16 and UTF-32 because they are multi-byte encodings that can be written in both big-endian and little-endian formats. This requires a mechanism (like a Byte Order Mark — BOM) to indicate which order the bytes are in. UTF-8 does not have this problem, which simplifies its use across different platforms.
Wide Adoption and Support
The combination of these factors has led to UTF-8 being widely adopted and supported across many operating systems, programming languages, libraries, and applications. This widespread adoption creates a positive feedback loop — since everyone else is using UTF-8, it becomes the default choice for new systems and software.
However, UTF-8 is not always the best choice. For texts that consist heavily of non-Latin characters, UTF-16 may be more efficient because it can represent most characters in just two bytes instead of three or four. And UTF-32 can be preferable in situations where memory is not an issue, and fixed-width characters simplify text processing — although such situations are less common.
In conclusion, UTF-8 strikes a good balance between space efficiency for ASCII characters, compatibility, and simplicity for network transmission, which has led to its prevalence in many applications, especially on the web.
Normalization
Unicode normalization is the process of converting text into a consistent format. It’s essential because some characters can be represented in multiple ways. For example, the letter “é” can be represented as a single code point U+00E9 or as a combination of “e” (U+0065) and an acute accent (U+0301). Normalization ensures that these equivalent sequences are treated consistently in applications.
Collation
Collation refers to the ordering of characters in a way that aligns with the conventions and expectations of human languages. Unicode provides guidelines for collation, which can be complex due to differences in how various languages handle sorting.
Case Folding
Unicode also specifies case folding rules, which are similar to lowercase conversion but are designed for case-insensitive comparisons. Case folding maps characters in a way that disregards case, providing a consistent way to compare strings in a case-insensitive manner.
Analogies to Understand Unicode
- Unicode as a Library: Imagine Unicode as a vast library, where every book represents a different language or set of symbols, and every character in those books is a page with a unique page number (the code point).
- Planes as Floors in a Building: The Unicode planes can be likened to different floors in a large building. The ground floor (BMP) has the rooms (characters) we use every day, while the upper floors (supplementary planes) have more specialized suites (characters) that are used less frequently.
- Normalization as Standardizing Recipes: Different chefs might have their unique way of writing down a recipe for the same dish. Normalization is like creating a standard recipe format so that no matter who writes it, the ingredients and steps are presented consistently.
Understanding Unicode is key to developing software that is culturally and linguistically inclusive, ensuring that it can be used and appreciated by a global audience.
UTF-8’s Variable Width
The key feature of UTF-8 is that it is a variable-width encoding. This means that it uses only as many bytes as necessary for each character. This efficiency makes UTF-8 very popular for storing and transmitting text, especially for languages where many characters can be represented with 1-byte sequences.
Examples
- The ASCII character ‘A’ (U+0041) is represented in UTF-8 simply as
0x41
(in hexadecimal notation), which is the same as its ASCII representation. - The Euro symbol ‘€’ (U+20AC) requires three bytes in UTF-8:
0xE2 0x82 0xAC
. - An emoji like ‘😊’ (U+1F60A) is encoded with four bytes:
0xF0 0x9F 0x98 0x8A
.
Analogies
- Variable-width encoding as a train: Think of UTF-8 encoding like a train that can change its length depending on the number of passengers (characters). For ASCII characters, a small one-car train suffices. As the characters become more complex, the train adds more cars (bytes) to accommodate them.
- 1-byte sequences as postcards: ASCII characters in UTF-8 can be thought of as postcards that require minimal space (a single byte) and are simple enough to send as-is, without extra packaging.
- Multibyte sequences as parcels: Characters beyond ASCII are like parcels that require extra packaging (additional bytes). The more unusual the item (character), the more packaging layers (bytes) are needed.
- Compatibility with ASCII as a bilingual person: UTF-8’s compatibility with ASCII is like a bilingual person who speaks both English and another complex language. They can communicate easily in English (ASCII) using short, simple words (1-byte sequences). But for more nuanced concepts (non-ASCII characters), they switch to the complex language, using longer phrases (multi-byte sequences).
The String
Type
In Rust, String
is a growable, mutable, owned, UTF-8 encoded string type. When you want to create a string that can change at runtime, you use a String
. You can think of String
as a vector of bytes (Vec<u8>
), but with a twist: it ensures that its contents are always valid UTF-8 sequences.
Creating a String
let mut s = String::new(); // create an empty String
s.push_str("hello"); // push a &str onto the String
Analogy
Think of String
as a bookshelf that you own. You can add books (push
characters or strings), take them away, or rearrange them (mutate
the String) as much as you like.
The str
Type
The str
type, often seen in its borrowed form &str
, is an immutable sequence of UTF-8 bytes. It is commonly referred to as a "string slice". A &str
is a reference to a string and is the preferred way to pass strings around in Rust because it is more efficient than passing around owned String
objects.
Creating a &str
let s = "hello"; // this is a &str
This &str
is actually a slice pointing to a specific point of the binary's read-only memory, which is why &str
is immutable.
Analogy
You can think of &str
as a bookmark. It doesn’t own the book (String
); it just marks a place in it, referring to a specific passage or the whole text.
Converting Between String
and &str
You can easily convert between a String
and a &str
:
let s = String::from("hello"); // Convert a &str to a String
let slice = &s; // Borrow the String as a &str
Analogy
Imagine going to the library (borrowing a &str
) vs. buying the book (String
). When you borrow it, you can't change it and have to give it back, reflecting the borrowing and immutability concepts in Rust.
UTF-8 Byte Arrays
Sometimes you need to interact with raw bytes. In Rust, a UTF-8 encoded String
or &str
can be viewed as a byte array, which is useful when you need to interface with systems or libraries that don't understand Rust strings but do understand bytes.
Example: String to Bytes
let s = String::from("hello");
let bytes = s.as_bytes(); // Convert the String to a UTF-8 byte array
Analogy
This is like getting the ASCII codes for each letter in your book (String
), giving you a numerical representation of your text.
When to Use Each
- Use
String
when you need owned, mutable data. For example, when you're building a string or modifying it at runtime. - Use
&str
when you just need to read or pass around string data without ownership. This is common in function arguments and for static strings. - Use byte arrays (
[u8]
orVec<u8>
) when you need to operate at the byte level, such as when dealing with files or network data, or when interfacing with non-Rust codebases or libraries.
In Practice
fn greet(name: &str) -> String {
format!("Hello, {}!", name) // format! macro returns a String
}
fn main() {
let name = "Alice";
let greeting = greet(name);
println!("{}", greeting);
let bytes = greeting.as_bytes();
for byte in bytes {
println!("{}", byte); // prints the UTF-8 bytes of the string
}
}
In the example above:
- We define a function
greet
that takes a&str
and returns aString
. - Inside
main
, we callgreet
with a&str
literal and receive aString
in return. - We then print out each byte of the
String
as a UTF-8 byte array.
Rust Memory Allocation for Different String Types
Allocation of String
Types
String
A String
is a heap-allocated data structure. It is essentially a wrapper around a Vec<u8>
, which represents a buffer of UTF-8 bytes. Since the size of the string can change, it needs to be allocated on the heap. The String
type itself is stored on the stack, but the data it points to is on the heap.
let mut s = String::from("hello"); // 's' is on the stack, its data is on the heap
When you mutate the string, for example by using push_str
to append more characters, Rust may need to allocate more space on the heap to accommodate the changes.
Allocation of str
Types
&str
The &str
type is an immutable slice that references a sequence of UTF-8 bytes. The str
itself doesn’t have a size known at compile time—it’s a dynamically-sized type (DST). Thus, you can’t have a plain str
on the stack. Instead, you use it as &str
, which is a reference to a str
.
Here’s how &str
can be allocated:
- Stack: When you have a string literal in your Rust program, the actual bytes of that string are embedded directly in the final binary and are therefore stack-allocated. A
&str
can be a reference to this stack-allocated data.
let s = "hello"; // 's' is a reference on the stack to data also on the stack
- Heap: If you take a slice of a
String
, you get a&str
that points to the data on the heap. Here, the&str
itself (the reference) is on the stack, but it points to the heap-allocated buffer of theString
.
let s = String::from("hello"); let slice = &s[..]; // 'slice' is on the stack, pointing to data on the heap
Allocation of UTF-8 Byte Arrays
When dealing with raw bytes, you may work with [u8]
or Vec<u8>
. Here's how they are allocated:
Stack: A fixed-size array of bytes, like [u8; 5]
, is allocated on the stack.
let bytes: [u8; 5] = [104, 101, 108, 108, 111]; // stack-allocated fixed-size array
Heap: If you have a Vec<u8>
, the Vec
structure is on the stack but the data it points to is on the heap, much like a String
.
let bytes = vec![104, 101, 108, 108, 111]; // 'bytes' is on the stack, its data is on the heap
Rust’s ownership system ensures that heap-allocated memory is automatically freed when the owner of the data goes out of scope, preventing memory leaks.
When to use String versus str
Choosing between String
and &str
in Rust depends on several factors such as ownership, lifetime, and mutability of the data you are working with. Here's a guideline on when to use each:
Use String
when:
You need ownership: If your data needs to be owned by a particular variable, for instance when you’re returning a string from a function and want to transfer ownership outside that function, you should use String
.
fn create_welcome_message(name: &str) -> String { format!("Welcome, {}!", name) // Returns an owned String }
You need to modify or mutate the string: If you’re appending characters, concatenating strings, or otherwise changing the content, String
is your go-to since &str
is immutable.
let mut s = String::from("hello"); s.push_str(", world!"); // Mutates the string by appending
The size is unknown or variable at compile time: Whenever you build a string dynamically, such as reading from a file or user input, you cannot know the size at compile time, so you use a String
.
let mut s = String::new(); io::stdin().read_line(&mut s).expect("Failed to read line"); // Reads user input into a String
Use &str
when:
You are dealing with string literals or fixed strings: Since string literals are known at compile time and are immutable, they are naturally &str
. They are fast and efficient because they are embedded in the binary and don’t require allocation on the heap.
let s = "This is a fixed string"; // This is a string slice (&str)
You need to borrow a string: If you just need to read or inspect the string without taking ownership, use a &str
. This is very common in function arguments that don't need to mutate or keep the string.
fn print_message(message: &str) { println!("{}", message); // Borrowing a string slice, not taking ownership }
Performance considerations: Borrowing a &str
is typically faster than using a String
because it does not involve heap allocation. If a function can work with a borrowed slice, it’s usually a good default choice.
fn string_length(s: &str) -> usize { s.len() // Just borrows and checks the length of the string slice }
You are slicing strings: When you take a substring of another string (which can be a String
or &str
), you are creating a &str
.
let s = String::from("hello"); let slice = &s[0..2]; // slice is a &str
For generic functions: Functions that can accept both String
and &str
can be written to take a &str
argument, making them more flexible.
fn takes_slice(s: &str) { // ... } let owned_string = String::from("hello"); let string_literal = "hello"; takes_slice(&owned_string); // Works with &String takes_slice(string_literal); // Works with string literals
Remember, using &str
whenever possible can improve the performance of your application, as it avoids unnecessary memory allocation. However, String
becomes necessary when you need to own or modify the string data. Rust's type system and borrowing rules help to enforce the proper use of these string types, guiding you towards writing safe and efficient code.
Conclusion
Phew! That was quite the journey, wasn’t it? From understanding the different flavors of strings in Rust to decoding the intricacies of UTF-8, we’ve covered a lot of ground. Think of String
and str
as the yin and yang of text in Rust, each with its place and purpose. And let's not forget our byte-sized buddy, UTF-8, who keeps our text consistent across platforms worldwide.
Remember, Unicode is like the DNA of digital text — it’s complex, but without it, we wouldn’t have the rich, diverse communication we enjoy across our global village of gadgets and gizmos. With the knowledge you’ve gained, you’re now equipped to venture forth and craft some truly universal code that speaks in every language under the sun. So go ahead, make your mark in this polyglot world of programming, and let your Rust code sing in perfect harmony with Unicode!
And when you next send a smiley to a friend or write that polyglot application, tip your hat to the silent, steadfast standard that made it all possible — Unicode, with a little help from its trusty sidekick, UTF-8. Happy coding!
Check out some interesting hands-on Rust articles!
🌟 Developing a Fully Functional API Gateway in Rust — Discover how to set up a robust and scalable gateway that stands as the frontline for your microservices.
🌟 Implementing a Network Traffic Analyzer — Ever wondered about the data packets zooming through your network? Unravel their mysteries with this deep dive into network analysis.
🌟 Building an Application Container in Rust — Join us in creating a lightweight, performant, and secure container from scratch! Docker’s got nothing on this.
🌟 Implementing a P2P Database in Rust: Today, we’re going to roll up our sleeves and get our hands dirty building a Peer-to-Peer (P2P) key-value database.
🌟 Building a Function-as-a-Service (FaaS) in Rust: If you’ve been exploring cloud computing, you’ve likely come across FaaS platforms like AWS Lambda or Google Cloud Functions. In this article, we’ll be creating our own simple FaaS platform using Rust.
🌟 Building an Event Broker in Rust: We’ll explore essential concepts such as topics, event production, consumption, and even real-time event subscriptions.
Read more articles about Rust in my Rust Programming Library!
Visit my Blog for more articles, news, and software engineering stuff!
Follow me on Medium, LinkedIn, and Twitter.
Leave a comment, and drop me a message!
All the best,
Luis Soares
CTO | Tech Lead | Senior Software Engineer | Cloud Solutions Architect | Rust 🦀 | Golang | Java | ML AI & Statistics | Web3 & Blockchain