1 unstable release
0.1.0 | Dec 28, 2024 |
---|
#204 in Concurrency
107 downloads per month
39KB
724 lines
DiskANN Implementation in Rust
A Rust implementation of DiskANN (Disk-based Approximate Nearest Neighbor search) featuring a 3-layer index architecture and parallel query processing. This project provides an efficient and scalable solution for large-scale vector similarity search with single-file storage.
Overview
This implementation provides a memory-efficient approach to similarity search by:
- Using a 3-layer hierarchical index structure for faster search
- Storing all data in a single file using memory mapping
- Supporting both Euclidean distance and Cosine similarity
- Managing adjacency lists for graph-based search
- Implementing parallel query processing
- Supporting large-scale datasets that don't fit in RAM
Features
- Three-layer hierarchical index structure:
- Top layer (L0): Smallest, most selective layer
- Middle layer (L1): Intermediate connectivity
- Base layer (L2): Complete dataset
- Single-file storage format for simplified deployment
- Choice of distance metrics:
- Euclidean distance
- Cosine similarity
- Cluster-based graph construction for meaningful adjacency
- Parallel query processing using rayon
- Memory-mapped file access for handling large datasets
- Comprehensive error handling with custom error types
Usage
Building a New Index
use diskannrs::{SingleFileDiskANN, DistanceMetric};
let index = SingleFileDiskANN::build_index_singlefile(
1_000_000, // number of vectors
128, // dimension
32, // max neighbors per node
0.01, // fraction of vectors in top layer
0.1, // fraction of vectors in middle layer
DistanceMetric::Euclidean, // or DistanceMetric::Cosine
"index.db" // single file to store everything
)?;
Opening an Existing Index
let index = SingleFileDiskANN::open_index_singlefile("index.db")?;
Searching the Index
// Prepare your query vector
let query = vec![0.1, 0.2, ...; 128]; // must match index dimension
// Search for nearest neighbors
let k = 10; // number of neighbors to return
let beam_width = 64; // search beam width
let neighbors = index.search(&query, k, beam_width);
Parallel Search
use rayon::prelude::*;
use std::sync::Arc;
// Create shared index reference
let index = Arc::new(index);
// Perform parallel queries
let results: Vec<Vec<u32>> = query_batch
.par_iter()
.map(|query| index.search(query, k, beam_width))
.collect();
Performance Characteristics
- Memory Usage: O(1) for vector storage due to memory mapping
- Disk Space: Single file containing:
- Vectors: num_vectors * dimension * 4 bytes
- Adjacency Lists: Varies by layer size and max_degree
- Metadata: Small overhead
- Search Time: Logarithmic due to hierarchical structure
- Parallel Processing: Scales with available CPU cores
Building and Testing
# Build the library
cargo build --release
# Run tests
cargo test
# Run with example
cargo run --release --example simple_search
Current Status
This implementation features:
- Single-file storage format
- 3-layer hierarchical index structure
- Multiple distance metrics support
- Cluster-based graph construction
- Parallel query processing
- Memory-mapped I/O
- Comprehensive test suite
Future Improvements
- Add more distance metrics
- Implement dynamic index updates
- Add parameter auto-tuning
- Expand benchmarking suite
- Add more examples
- Improve documentation
Contributing
Contributions are welcome! Please feel free to:
- Open issues for bugs or feature requests
- Submit PRs for improvements
- Share ideas for optimization
License
This project is licensed under the MIT License - see the LICENSE file for details.
References
- Original DiskANN paper: DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
Dependencies
~2.2–3MB
~62K SLoC