Inline Assembly in Rust

Inline Assembly in Rust

Inline assembly in Rust, specifically with the asm! macro, allows developers to insert assembly language instructions directly into Rust code, enabling finer control over hardware and optimizations that can be critical in systems programming, performance-critical code, or specific CPU instruction sets.

In this article, we’ll cover the basics of using the asm! macro in Rust, highlight its syntax, and showcase a working example.

Understanding Inline Assembly with asm!

The asm! macro allows Rust programs to embed assembly instructions inline with Rust code. Introduced as a nightly-only feature (as of Rust 1.49), asm! replaces the older llvm_asm! syntax, offering a more robust and flexible approach to inline assembly.

Note: To use asm!, your project must use the nightly compiler and enable the asm feature.

Enabling Inline Assembly in Rust

enable the asm! feature at the beginning of your Rust file:

#![feature(asm)]

Basic Syntax of asm!

The basic structure of asm! in Rust is:

asm!(
"<assembly code>",
options(<options>)
);

  • Assembly Code: This is a string literal containing the assembly instructions. You can pass arguments, outputs, and modify registers directly in the assembly code.
  • Options: Optional flags to control how the asm! behaves, such as preserving flags, nostack, etc.

A Simple asm! Example

Let’s create a simple example where we use asm! to add two numbers. The goal here is to use the add assembly instruction directly to demonstrate how inline assembly can interact with Rust variables.

#![feature(asm)]

fn main() {
let mut result: u32;
let x: u32 = 10;
let y: u32 = 20;
unsafe {
asm!(
"add {0}, {1}, {2}",
out(reg) result,
in(reg) x,
in(reg) y,
);
}
println!("The result of {} + {} is {}", x, y, result);
}

  • out(reg) result: Specifies that result is an output variable, which will hold the result of the addition.
  • in(reg) x and in(reg) y: Specifies x and y as input registers for the add operation.
  • add {0}, {1}, {2}: This line uses the add assembly instruction, adding the values in x and y and storing the result in result.

Compiling and Running the Code

To compile and run this code, follow these steps:

Switch to Nightly: Ensure you’re using the nightly version of Rust.

rustup override set nightly

Run the Program:

cargo run

If everything works correctly, the output should be:

The result of 10 + 20 is 30

A More Advanced Example: Bitwise Operation with asm!

For a more complex example, let’s implement a bitwise XOR operation using inline assembly:

#![feature(asm)]

fn main() {
let a: u32 = 0b1100;
let b: u32 = 0b1010;
let mut result: u32;
unsafe {
asm!(
"xor {0}, {1}, {2}",
out(reg) result,
in(reg) a,
in(reg) b,
);
}
println!("The result of {} XOR {} is {:b}", a, b, result);
}

This example demonstrates the xor instruction, which performs a bitwise XOR operation. The output should be:

The result of 12 XOR 10 is 110

Let’s explore a real-world example that showcases both the utility and performance benefits of inline assembly. In this case, we’ll use inline assembly to implement a CPU cycle counter to measure the execution time of specific code segments in Rust. Measuring CPU cycles is crucial for profiling performance in embedded systems, high-frequency trading, cryptography, and other performance-critical applications.

Real-World Use Case: CPU Cycle Counter

Counting CPU cycles is essential for precise performance profiling, as it provides a direct measurement of how long code takes to execute at the CPU level. This approach is particularly useful in real-time systems where nanosecond precision is required, or in embedded systems where power and processing resources are limited.

Benefits of Using Inline Assembly

  1. Precision: Inline assembly allows us to access the CPU’s time-stamp counter directly, providing a more accurate measure of time compared to standard functions like std::time::Instant.
  2. Efficiency: Accessing the CPU counter through Rust code alone would involve more overhead than using specific CPU instructions like RDTSC (Read Time-Stamp Counter).
  3. Platform-Specific Optimization: With inline assembly, we can leverage platform-specific instructions optimized for certain CPU architectures.

Implementing a Cycle Counter with Inline Assembly

In x86 and x86–64 architectures, we can use the RDTSC instruction to access the CPU’s time-stamp counter directly. Let’s see how to implement this using asm! in Rust.

#![feature(asm)]

/// Function to get the current CPU cycle count using the RDTSC instruction
fn get_cpu_cycles() -> u64 {
let high: u32;
let low: u32;
unsafe {
// Read the time-stamp counter into two 32-bit registers
asm!(
"rdtsc",
out("eax") low, // Lower 32 bits go into `low`
out("edx") high // Higher 32 bits go into `high`
);
}
// Combine the high and low parts to get the full 64-bit counter
((high as u64) << 32) | (low as u64)
}

fn main() {
// Measure CPU cycles taken for a sample code block
let start = get_cpu_cycles();
// Sample code block (e.g., complex calculation, simulation, etc.)
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
let end = get_cpu_cycles();
println!("The sum is: {}", sum);
println!("CPU cycles taken: {}", end - start);
}

get_cpu_cycles(): This function uses the RDTSC (Read Time-Stamp Counter) instruction to retrieve the CPU's time-stamp counter, which counts the number of cycles since the last reset.

out("eax") low and out("edx") high specify output registers. In x86 assembly, RDTSC places the low 32 bits of the cycle count in EAX and the high 32 bits in EDX.

The high and low parts are combined into a single 64-bit value to represent the full cycle count.

Performance Measurement: The main function demonstrates a simple way to measure CPU cycles for a block of code. We capture the start cycle count before a loop and the end count after, allowing us to calculate the cycles taken for the loop.

Benefits of This Approach

  1. High Precision: Using RDTSC provides a high-precision, low-overhead way to measure cycles, as it avoids the typical delays of OS-level timing functions.
  2. Minimal Overhead: Accessing the time-stamp counter directly has almost zero overhead compared to higher-level abstractions, making it ideal for profiling short code blocks where every cycle counts.
  3. Deterministic and Consistent: RDTSC reads directly from the CPU, so it's not affected by OS scheduling or thread preemption, making it more consistent for benchmarking purposes.

Enhanced Profiling: Using Inline Assembly for More Robust Performance Timing

In the following example, we’ll use both the RDTSC and RDTSCP instructions to count CPU cycles. RDTSC alone can be unreliable on modern multi-core processors since it doesn’t serialize CPU operations. Using RDTSCP addresses this by ensuring the instruction waits until all previous instructions have been executed, providing a more accurate cycle count.

Improved CPU Cycle Counter Example

The example below shows a cycle counter that uses both RDTSC at the beginning and RDTSCP at the end, ensuring a precise and isolated cycle count of a critical code block.

#![feature(asm)]

/// Function to retrieve CPU cycle count using `RDTSC` at the start and `RDTSCP` at the end
fn get_cpu_cycles_pair() -> (u64, u64) {
let start_high: u32;
let start_low: u32;
let end_high: u32;
let end_low: u32;
unsafe {
// Start cycle count
asm!(
"cpuid", // Serialize to prevent out-of-order execution
"rdtsc", // Read time-stamp counter
out("eax") start_low, // Lower 32 bits
out("edx") start_high, // Higher 32 bits
options(nostack) // Prevents stack pointer adjustments
);
// Critical code section goes here
// (simulate work with a lightweight loop or function call)
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
// End cycle count
asm!(
"rdtscp", // Read time-stamp counter with ordering
out("eax") end_low, // Lower 32 bits
out("edx") end_high, // Higher 32 bits
"cpuid", // Serialize to prevent out-of-order execution
options(nostack)
);
}
// Combine high and low parts into a single 64-bit value
let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
(start_cycles, end_cycles)
}

fn main() {
let (start, end) = get_cpu_cycles_pair();
println!("CPU cycles taken: {}", end - start);
}

  1. Serializing with cpuid: The cpuid instruction is used to prevent out-of-order execution, ensuring that all instructions before the RDTSC or RDTSCP call have completed. This is crucial in multi-core and high-performance environments to maintain accuracy.
  2. Start and End Counters:
  • Start Counter (RDTSC): We call RDTSC at the beginning of the code block to capture the start cycle count.
  • End Counter (RDTSCP): RDTSCP at the end reads the counter with an inherent ordering, providing a more accurate end cycle count.

3. Critical Code Section: The code you want to measure (e.g., a loop) is placed between the two cycle counter instructions. In practice, this might be a cryptographic function, a data processing loop, or another performance-critical task.

Real-World Scenarios and Benefits

This method is highly beneficial in specific contexts:

  • Embedded Systems and Real-Time Applications: Precise cycle counting helps developers ensure that code execution times meet strict timing requirements, especially in systems where every microsecond counts, like automotive control units or medical devices.
  • Cryptographic Algorithms: Cycle-accurate profiling is essential in cryptography, where timing leaks can potentially expose information about secret data. Precise measurement ensures no unexpected performance bottlenecks or vulnerabilities.
  • High-Performance Trading: In financial systems, even minor delays can affect profitability. Cycle counting helps optimize latency-sensitive functions, like order matching or risk calculations.
  • Performance Optimization: For any CPU-intensive application, cycle-level measurement can reveal exactly which parts of the code consume the most resources, guiding targeted optimizations.

Extending Inline Assembly Usage with RDTSC and RDTSCP in Rust

Measuring the Impact of Different Code Blocks

Let’s extend our example by measuring two different code blocks to see how they compare in terms of CPU cycles. This technique is common in performance engineering, where you might want to assess the relative cost of different implementations or functions.

#![feature(asm)]

/// Function to retrieve CPU cycle count using `RDTSC` and `RDTSCP`
fn measure_code_cycles<F>(func: F) -> u64
where
F: FnOnce(),
{
let start_high: u32;
let start_low: u32;
let end_high: u32;
let end_low: u32;
unsafe {
// Start cycle count
asm!(
"cpuid",
"rdtsc",
out("eax") start_low,
out("edx") start_high,
options(nostack)
);
// Run the passed function
func();
// End cycle count
asm!(
"rdtscp",
out("eax") end_low,
out("edx") end_high,
"cpuid",
options(nostack)
);
}
// Combine high and low parts into a single 64-bit value
let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
end_cycles - start_cycles
}

fn main() {
// Define two different code blocks to profile
let cycles_block1 = measure_code_cycles(|| {
// Block 1: A simple for loop
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
});
let cycles_block2 = measure_code_cycles(|| {
// Block 2: Simulating more complex work
let mut product = 1;
for i in 1..1000 {
product *= i;
}
});
println!("Cycles for Block 1: {}", cycles_block1);
println!("Cycles for Block 2: {}", cycles_block2);
}

  • Function as a Parameter: We use a generic function measure_code_cycles that takes a closure, func, allowing any code block to be passed for profiling.
  • Reusability: This setup allows you to measure any function or block of code, making it easy to compare different algorithms, implementations, or optimizations in a structured and repeatable manner.
  • Precision and Comparisons: By measuring different blocks, you can directly compare cycle costs and make informed decisions on optimizations.

Output

Running this code will display the number of CPU cycles taken for each code block, allowing you to compare their performance.

Cycles for Block 1: 52345678
Cycles for Block 2: 12567890

Limitations and Considerations

  • Multi-Core and Hyper-Threaded CPUs: Due to variability across CPU cores and threads, RDTSC and RDTSCP might show inconsistent results on multi-threaded systems. Affinity settings or single-threaded execution can help mitigate this.
  • Dynamic Frequency Scaling (DFS): Modern CPUs often adjust their frequency dynamically, which can skew cycle counts. Running on a high-performance setting or disabling frequency scaling (if possible) can improve accuracy.
  • Platform-Specific: This approach is currently limited to x86 and x86–64 platforms, though similar mechanisms exist for other architectures like ARM (e.g., PMCCNTR for ARM CPUs).

Tips for Using asm!

  • Safety: Inline assembly is inherently unsafe. Wrapping asm! in unsafe blocks is required.
  • Registers: Use reg to let the compiler choose the best available general-purpose register. Specify const instead of in to pass a constant to assembly.
  • Options: The options argument can specify flags such as volatile, preserve_flags, or nostack, giving you more control over assembly behavior.

When to Use asm!

Inline assembly is powerful, but it’s essential to consider when it’s appropriate:

  • Low-Level Hardware Interaction: Directly interface with hardware where specific CPU instructions are needed.
  • Performance-Critical Code: Optimize particular code paths by controlling CPU instructions.
  • Operating Systems and Embedded Programming: When interacting with the OS or low-level hardware, inline assembly provides precise control.

🚀 Discover More Free Software Engineering Content! 🌟

If you enjoyed this post, be sure to explore my new software engineering blog, packed with 200+ in-depth articles, 🎥 explainer videos, 🎙️ a weekly software engineering podcast, 📚 books, 💻 hands-on tutorials with GitHub code, including:

🌟 Developing a Fully Functional API Gateway in Rust— Discover how to set up a robust and scalable gateway that stands as the frontline for your microservices.

🌟 Implementing a Network Traffic Analyzer — Ever wondered about the data packets zooming through your network? Unravel their mysteries with this deep dive into network analysis.

🌟Implementing a Blockchain in Rust — a step-by-step breakdown of implementing a basic blockchain in Rust, from the initial setup of the block structure, including unique identifiers and cryptographic hashes, to block creation, mining, and validation, laying the groundwork.

and much more!

200+ In-depth software engineering articles
🎥 Explainer Videos — Explore Videos
🎙️ A brand-new weekly Podcast on all things software engineering — Listen to the Podcast
📚 Access to my books — Check out the Books
💻 Hands-on Tutorials with GitHub code
📞 Book a Call

👉 Visit, explore, and subscribe for free to stay updated on all the latest: Home Page

LinkedIn Newsletter: Stay ahead in the fast-evolving tech landscape with regular updates and insights on Rust, Software Development, and emerging technologies by subscribing to my newsletter on LinkedIn. Subscribe Here

🔗 Connect with Me:

  • LinkedIn: Join my professional network for more insightful discussions and updates. Connect on LinkedIn
  • X: Follow me on Twitter for quick updates and thoughts on Rust programming. Follow on Twitter

Wanna talk? Leave a comment or drop me a message!

All the best,

Luis Soares
luis@luissoares.dev

Lead Software Engineer | Blockchain & ZKP Protocol Engineer | 🦀 Rust | Web3 | Solidity | Golang | Cryptography | Author

Read more