2 releases

0.1.1 May 7, 2024
0.1.0 Apr 22, 2024

#748 in Database interfaces

MIT/Apache

54KB
1K SLoC

Vectune: fast Vamana indexing

License: MIT License: Apache 2.0

Vectune is a lightweight VectorDB with Incremental Indexing, based on FreshVamana. This project is implemented with the support of KinicDAO and powers the backend of KinicVectorDB for vector indexing.

Getting Start

By specifying progress-bar in features, you can check the progress of indexing.

[dependencies]
vectune = {version = "0.1.0", features = ["progress-bar"]}

To perform calculations of Euclidean distances quickly using SIMD, it is necessary to specify nightly in example. If the rust-analyzer in VSCode gives an error for #![feature(portable_simd)], please set up your .vscode/settings.json.

{
  "rust-analyzer.server.extraEnv": {
      "RUSTUP_TOOLCHAIN": "nightly"
  },
}

Example

Setup and Run

To test with the SIFT1M dataset, please execute the following command. SIFT1M is a dataset of 1 million data points, each with 128 dimensions.

curl ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz -o examples/test_data/sift.tar.gz
tar -xzvf examples/test_data/sift.tar.gz -C examples/test_data

cargo +nightly run --release --features progress-bar --example sift1m

How it works

Indexing is performed on the data using a Builder, and searches and insertions are conducted on the graph.

use vectune::{Builder, GraphInterface, PointInterface};

let points = Vec::new();
for vec in base_vectors {
    points.push(Point(vec.to_vec()));
}

let (nodes, centroid) = Builder::default()
    .progress(ProgressBar::new(1000))
    .build(points);

let mut graph = Graph::new(nodes, centroid);

let k = 50;

let (top_k_results, _visited) = vectune::search(&mut graph, &Point(query.to_vec()), k);

PointInterface Trait

You will need to define the dimensions and data type of the vectors used, as well as the method for calculating distance.

Please implement the following four methods:

  • distance(&self, other: &Self) -> f32
  • fn dim() -> u32
  • fn add(&self, other: &Self) -> Self
  • fn div(&self, divisor: &usize) -> Self

distance() can be optimized using SIMD. Please refer to ./examples/src/bin/sift1m.rs.

The following example provides a simple implementation.

use vectune::PointInterface;

#[derive(Serialize, Deserialize, Clone, Debug)]
struct Point(Vec<f32>);
impl Point {
    fn to_f32_vec(&self) -> Vec<f32> {
        self.0.iter().copied().collect()
    }
    fn from_f32_vec(a: Vec<f32>) -> Self {
        Point(a.into_iter().collect())
    }
}
impl PointInterface for Point {
    fn distance(&self, other: &Self) -> f32 {
        self.0
            .iter()
            .zip(other.0.iter())
            .map(|(a, b)| {
                let c = a - b;
                c * c
            })
            .sum::<f32>()
            .sqrt()
    }
    fn dim() -> u32 {
        384
    }
    fn add(&self, other: &Self) -> Self {
        Point::from_f32_vec(
            self.to_f32_vec()
                .into_iter()
                .zip(other.to_f32_vec().into_iter())
                .map(|(x, y)| x + y)
                .collect(),
        )
    }
    fn div(&self, divisor: &usize) -> Self {
        Point::from_f32_vec(
            self.to_f32_vec()
                .into_iter()
                .map(|v| v / *divisor as f32)
                .collect(),
        )
    }
}

GraphInterface Trait

To accommodate the entire graph on storage solutions other than SSDs or other memory types, you need to implement the GraphInterface.

Please implement the following eleven methods:

  • fn alloc(&mut self, point: P) -> usize
  • fn free(&mut self, id: &usize)
  • fn cemetery(&self) -> Vec<usize>
  • fn clear_cemetery(&mut self)
  • fn backlink(&self, id: &usize) -> Vec<usize>
  • fn get(&mut self, id: &usize) -> (P, Vec<usize>)
  • fn size_l(&self) -> usize
  • fn size_r(&self) -> usize
  • fn size_a(&self) -> f32
  • fn start_id(&self) -> usize
  • fn overwirte_out_edges(&mut self, id: &usize, edges: Vec<usize>)

self.get() is defined with &mut self because it handles caching from SSDs and other storage devices.

In vectune::search(), nodes returned by self.cemetery() are marked as tombstones and are excluded from the search results. Additionally, they are permanently deleted in vectune::delete().

You need to manage backlinks when adding or deleting nodes. This is utilized in vectune::delete().

The following example provides a simple on-memory implementation.

use vectune::GraphInterface;
use itertools::Itertools;

struct Graph<P>
where
    P: VPoint,
{
    nodes: Vec<(P, Vec<u32>)>,
    backlinks: Vec<Vec<u32>>,
    cemetery: Vec<u32>,
    centroid: u32,
}

impl<P> VGraph<P> for Graph<P>
where
    P: VPoint,
{
    fn alloc(&mut self, point: P) -> u32 {
        self.nodes.push((point, vec![]));
        self.backlinks.push(vec![]);
        (self.nodes.len() - 1) as u32
    }

    fn free(&mut self, _id: &u32) {
        // todo!()
    }

    fn cemetery(&self) -> Vec<u32> {
        self.cemetery.clone()
    }

    fn clear_cemetery(&mut self) {
        self.cemetery = Vec::new();
    }

    fn backlink(&self, id: &u32) -> Vec<u32> {
        self.backlinks[*id as usize].clone()
    }

    fn get(&mut self, id: &u32) -> (P, Vec<u32>) {
        let node = &self.nodes[*id as usize];
        node.clone()
    }

    fn size_l(&self) -> usize {
        125
    }

    fn size_r(&self) -> usize {
        70
    }

    fn size_a(&self) -> f32 {
        2.0
    }

    fn start_id(&self) -> u32 {
        self.centroid
    }

    fn overwirte_out_edges(&mut self, id: &u32, edges: Vec<u32>) {
        for out_i in &self.nodes[*id as usize].1 {
            let backlinks = &mut self.backlink(out_i);
            backlinks.retain(|out_i| out_i != id)
        }

        for out_i in &edges {
            let backlinks = &mut self.backlink(out_i);
            backlinks.push(*id);
            backlinks.sort();
            backlinks.dedup();
        }

        self.nodes[*id as usize].1 = edges;
    }
}

Indexing

  • a is the threshold for RobustPrune; increasing it results in more long-distance edges and fewer nearby edges.
  • r represents the number of edges; increasing it adds complexity to the graph but reduces the number of isolated nodes.
  • l is the size of the retention list for greedy-search; increasing it allows for the construction of more accurate graphs, but the computational cost grows exponentially.
  • seed is used for initializing random graphs; it allows for the fixation of the random graph, which can be useful for debugging.
let (nodes, centroid) = Builder::default()
    .set_a(2.0)
    .set_r(70)
    .set_l(125)
    .set_seed(11677721592066047712)
    .progress(ProgressBar::new(1000))
    .build(points);

Searching

k represents the number of top-k results. It is necessary that k <= l.

vectune::search(&mut graph, &point, k);

Inserting

vectune::insert(&mut graph, point);

Deleting

Completely remove the nodes returned by graph.cemetery() from the graph.

vectune::delete(&mut graph);

Dependencies

~3–11MB
~127K SLoC