Implementing a Vector Database in Rust

Unlike traditional databases that store scalar data (like integers, strings, etc.), vector databases are designed to efficiently store and…

Luis Soares

21 Jan 2024 — 5 min read

Unlike traditional databases that store scalar data (like integers, strings, etc.), vector databases are designed to efficiently store and retrieve vector data — collections of numerical values representing points in a multi-dimensional space.

This article will explore how to implement a basic vector database in Rust.

Let’s dive right in! 🦀

What is a Vector Database?

A vector database is a type of database that is optimized for storing and querying vectors, which are arrays of numbers representing points in a high-dimensional space. These databases are essential in applications where similarity search in large datasets is a key operation, such as in recommendation systems, image retrieval, and natural language processing.

Key concepts in vector databases include:

Vector Representation: Vectors in these databases represent data points. For instance, in image recognition, an image might be represented as a high-dimensional vector where each dimension corresponds to a feature of the image.
Distance Metrics: To retrieve similar vectors, the database needs a way to quantify how ‘close’ or ‘similar’ two vectors are. Common metrics include Euclidean distance, Manhattan distance, and cosine similarity.
Indexing and Search Algorithms: Efficient search in high-dimensional spaces is a challenging problem. Vector databases often employ specialized indexing strategies to speed up query times, such as KD-trees, R-trees, or hashing-based approaches.

Implementing a Vector Database in Rust

Step 1: Setting Up the Rust Environment

Before we start coding, ensure you have Rust installed. Rust’s package manager, Cargo, makes it easy to set up a new project:

cargo new vector_db 
cd vector_db

Step 2: Defining the Vector Type

In Rust, we can define a vector as a fixed-size array or use dynamic arrays from the standard library. For simplicity, let’s use fixed-size arrays ([f32; N]) where N is the dimension of the vector:

type Vector = [f32; 3]; // Example for 3D vectors

Step 3: Creating the Database Structure

We’ll create a struct VectorDB that will act as our database:

struct VectorDB { 
    vectors: Vec<Vector>, 
}

Step 4: Implementing Basic Operations

Now, let’s add methods to add and retrieve vectors:

impl VectorDB { 
    fn new() -> Self { 
        VectorDB { vectors: Vec::new() } 
    } 
 
   fn add_vector(&mut self, vector: Vector) { 
          self.vectors.push(vector); 
      } 
  
   fn get_vector(&self, index: usize) -> Option<&Vector> { 
        self.vectors.get(index) 
   } 
}

Step 5: Adding a Search Function

To find the vector closest to a given query vector, we’ll implement a simple linear search based on Euclidean distance:

impl VectorDB { 
    // Existing methods... 
 
fn find_closest(&self, query: Vector) -> Option<&Vector> { 
        self.vectors.iter().min_by(|&a, &b| { 
            let distance_a = VectorDB::euclidean_distance(&query, a); 
            let distance_b = VectorDB::euclidean_distance(&query, b); 
            distance_a.partial_cmp(&distance_b).unwrap() 
        }) 
    } 
    fn euclidean_distance(a: &Vector, b: &Vector) -> f32 { 
        a.iter().zip(b.iter()).map(|(x, y)| (x - y).powi(2)).sum::<f32>().sqrt() 
    } 
}

Step 6: Testing Our Vector Database

Finally, let’s test our database in the main function:

fn main() { 
    let mut db = VectorDB::new(); 
    db.add_vector([1.0, 2.0, 3.0]); 
    db.add_vector([4.0, 5.0, 6.0]); 
 
// Retrieving and printing a vector 
    if let Some(vector) = db.get_vector(0) { 
        println!("Vector at index 0: {:?}", vector); 
    } 
    // Finding and printing the closest vector 
    if let Some(closest) = db.find_closest([2.0, 3.0, 4.0]) { 
        println!("Closest vector: {:?}", closest); 
    } 
}

Implementing more efficient indexing

We’ll focus on a basic yet effective indexing strategy known as KD-Tree (K-dimensional Tree), a space-partitioning data structure used for organizing points in a K-dimensional space. KD-Trees are particularly useful for efficient nearest neighbor searches in multi-dimensional keys.

Step 1: Understanding KD-Trees

A KD-Tree is a binary tree in which every node is a K-dimensional point. Every non-leaf node generates a splitting hyperplane that divides the space into two halves. Points to the left of this hyperplane are represented by the left subtree, while points to the right are represented by the right subtree.

Key Concepts:

Splitting Dimension: At each level of the tree, a different dimension is chosen for splitting the data. The choice of dimension typically cycles through all dimensions.
Median Finding: To split the points at each node, the median along the chosen dimension is selected. This approach helps balance the tree.
Recursive Partitioning: The process continues recursively, resulting in a tree where each leaf node represents a point in the space.

Step 2: Defining the KD-Tree Structure in Rust

First, define the structure of the KD-Tree. You’ll need to represent both internal nodes (with splitting information) and leaf nodes (with actual vector data):

enum KDTreeNode { 
    Leaf(Vector), 
    Internal { 
        left: Box<KDTreeNode>, 
        right: Box<KDTreeNode>, 
        split_value: f32, 
        split_dimension: usize, 
    }, 
} 
 
struct KDTree { 
    root: KDTreeNode, 
}

Step 3: Building the KD-Tree

The process involves sorting points based on the splitting dimension and recursively building the tree:

impl KDTree { 
    fn build(points: Vec<Vector>, depth: usize) -> KDTreeNode { 
        if points.len() == 1 { 
            return KDTreeNode::Leaf(points[0]); 
        } 
 
    let dim = depth % K; // K is the number of dimensions 
        let mut sorted_points = points; 
        sorted_points.sort_by(|a, b| a[dim].partial_cmp(&b[dim]).unwrap()); 
        let median_idx = sorted_points.len() / 2; 
        let median_value = sorted_points[median_idx][dim]; 
        KDTreeNode::Internal { 
            left: Box::new(KDTree::build(sorted_points[..median_idx].to_vec(), depth + 1)), 
            right: Box::new(KDTree::build(sorted_points[median_idx..].to_vec(), depth + 1)), 
            split_value: median_value, 
            split_dimension: dim, 
        } 
    } 
}

Step 4: Implementing Search in the KD-Tree

Searching for the nearest neighbor involves traversing the tree, checking distances, and possibly backtracking:

impl KDTree { 
    fn nearest_neighbor(&self, query: &Vector) -> Option<&Vector> { 
        self.nearest(query, &self.root, None, f32::MAX) 
    } 
 
fn nearest<'a>(&'a self, query: &Vector, node: &'a KDTreeNode, best: Option<&'a Vector>, best_dist: f32) -> Option<&'a Vector> { 
        match node { 
            KDTreeNode::Leaf(point) => { 
                let dist = VectorDB::euclidean_distance(query, point); 
                if dist < best_dist { 
                    Some(point) 
                } else { 
                    best 
                } 
            }, 
            KDTreeNode::Internal { left, right, split_value, split_dimension } => { 
                let next_node = if query[*split_dimension] < *split_value { left } else { right }; 
                let other_node = if query[*split_dimension] < *split_value { right } else { left }; 
                let updated_best = self.nearest(query, next_node, best, best_dist); 
                let updated_best_dist = updated_best.map_or(best_dist, |v| VectorDB::euclidean_distance(query, v)); 
                if (query[*split_dimension] - split_value).abs() < updated_best_dist { 
                    self.nearest(query, other_node, updated_best, updated_best_dist) 
                } else { 
                    updated_best 
                } 
            } 
        } 
    } 
}

Step 5: Testing the KD-Tree

Finally, integrate the KD-Tree into your vector database and test it:

fn main() { 
    let points = vec![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [2.0, 3.0, 4.0], ...]; 
    let kd_tree = KDTree::build(points, 0); 
 
if let Some(nearest) = kd_tree.nearest_neighbor(&[3.0, 3.0, 3.0]) { 
        println!("Nearest neighbor: {:?}", nearest); 
    } 
}

That’s all, fellow Rustaceans! 🦀

🚀 Explore a Wealth of Resources in Software Development and More by Luis Soares

📚 Learning Hub: Expand your knowledge in various tech domains, including Rust, Software Development, Cloud Computing, Cyber Security, Blockchain, and Linux, through my extensive resource collection:

Hands-On Tutorials with GitHub Repos: Gain practical skills across different technologies with step-by-step tutorials, complemented by dedicated GitHub repositories. Access Tutorials
In-Depth Guides & Articles: Deep dive into core concepts of Rust, Software Development, Cloud Computing, and more, with detailed guides and articles filled with practical examples. Read More
E-Books Collection: Enhance your understanding of various tech fields with a series of free e-Books, including titles like “Mastering Rust Ownership” and “Application Security Guide” Download eBook
Project Showcases: Discover a range of fully functional projects across different domains, such as an API Gateway, Blockchain Network, Cyber Security Tools, Cloud Services, and more. View Projects
LinkedIn Newsletter: Stay ahead in the fast-evolving tech landscape with regular updates and insights on Rust, Software Development, and emerging technologies by subscribing to my newsletter on LinkedIn. Subscribe Here

🔗 Connect with Me:

Medium: Read my articles on Medium and give claps if you find them helpful. It motivates me to keep writing and sharing Rust content. Follow on Medium
Personal Blog: Discover more on my personal blog, a hub for all my Rust-related content. Visit Blog
LinkedIn: Join my professional network for more insightful discussions and updates. Connect on LinkedIn
Twitter: Follow me on Twitter for quick updates and thoughts on Rust programming. Follow on Twitter

Wanna talk? Leave a comment or drop me a message!

All the best,

Luis Soares
luis.soares@linux.com

Implementing a Vector Database in Rust

Luis Soares

What is a Vector Database?

Implementing a Vector Database in Rust

Step 1: Setting Up the Rust Environment

Step 2: Defining the Vector Type

Step 3: Creating the Database Structure

Step 4: Implementing Basic Operations

Step 5: Adding a Search Function

Step 6: Testing Our Vector Database

Implementing more efficient indexing

Step 1: Understanding KD-Trees

Key Concepts:

Step 2: Defining the KD-Tree Structure in Rust

Step 3: Building the KD-Tree

Step 4: Implementing Search in the KD-Tree

Step 5: Testing the KD-Tree

🚀 Explore a Wealth of Resources in Software Development and More by Luis Soares

Read more

Dynamic Linking and Memory Relocations in Rust

Building an Error Correction System in Rust

Rust Lifetimes Made Simple

Rust: From Simple Functions to Advanced Abstractions