12 releases

0.1.11 May 6, 2024
0.1.10 Nov 17, 2023
0.1.9 Jun 6, 2023
0.1.7 Mar 6, 2023
0.1.1 Nov 26, 2020

#146 in Algorithms

Download history 8/week @ 2024-06-10 3/week @ 2024-06-17 8/week @ 2024-07-15 25/week @ 2024-07-29 8/week @ 2024-08-26 1/week @ 2024-09-16 49/week @ 2024-09-23

53 downloads per month
Used in 4 crates

MIT/Apache

205KB
3.5K SLoC

Some Minhash related algorithms

This crate provides implementation of some recent algorithms deriving from the original Minhash. They have better performance and are more general.

It implements:

  • ProbMinHash2, ProbMinHash3 and ProbMinHash3a as described in O. Ertl paper: ProbMinHash. A Class of of Locality-Sensitive Hash Algorithms for the Probability Jaccard Similarity (2020) probminhash Ertl or IEEE-2022

These algorithms compute an estimation of the Jaccard weighted index via sensitive hashing. It is an extension of the Jaccard index to the case where objects have a weight, or a multiplicity associated.
This Jaccard weighted index provides a metric on discrete probability distributions as explained in : Moulton Jiang. Maximally consistent sampling and the Jaccard index of probability distributions (2018) Moulton-Jiang-ieee or Moulton-Jiang-arxiv

Noting Jp the Jaccard weighted index, then 1. - Jp defines a metric on finite discrete probabilities.
This module is the core of the crate which has two other modules.

  • Superminhash

A new minwise Hashing Algorithm for Jaccard Similarity Estimation Otmar Ertl 2017-2018 Cf superminhash Ertl

This algorithm runs on unweighted objects and can sketch on a laptop billions of objects into f32/f64 vectors. The hash values are computed by the sketch method or can be computed before entering SuperMinHash methods.

It runs in one pass on data so it can be used in streaming.

A variant of this algorithm, Superminhash2, sketch data into u32/u64 vectors but is slower. It is accessed with the sminhash2 feature.

  • SetSketch

SetSketch: Filling the gap between MinHash and HyperLogLog Otmar Ertl 2021 arxiv or vldb

This algorithm runs on unweighted objects. It is slower than SuperMinHash but can sketch billions of objects into vectors of 16 bytes integers. Morever sketches are mergeable.
We provide sketching (adapted to LSH with Jaccard distance) and a cardinality estimator of the sketched set.

  • Densification algorithms above One Permutation Hashing (known as OPH).

    • Optimal Densification for Fast and Accurate Minwise Hashing.
      Anshumali Shrivastava 2017 pmlr-2017.

    • On densification for MinWise Hashing.
      Mai, Rao, Kapilevitch, Rossi, Abbasi-Yadkori, Sinha. pmlr-2020

  • ProbOrdMinHash2 is a locality-sensitive hashing for the edit distance implemented over ProbMinHash2 as in Ertl's probordminhash2.
    It is inspired by Marcais.G et al. BioInformatics 2019, see Marcais

  • Invhash

It is just a module providing invertible hash from u32 to u32 or u64 to u64 and can be used to run a prehash on indexes. (See reference to Thomas Wang's invertible integer hash functions in invhash.rs)

Some examples

Some example of usage (more in the tests in each module) consisting to estimate intersection of contents of 2 vectors:

  • Probminhash

An example of Prominhash3a with an IndexMap
(see test probminhash::tests::test_probminhash3a_count_intersection_unequal_weights)

    type FnvIndexMap<K, V> = IndexMap<K, V, FnvBuildHasher>;
    ...
    let mut wa : FnvIndexMap::<usize,f64> = FnvIndexMap::with_capacity_and_hasher(70, FnvBuildHasher::default());
    // initialize wa ...

    let mut wb : FnvIndexMap::<usize,f64> = FnvIndexMap::with_capacity_and_hasher(70, FnvBuildHasher::default());
    // initialize ...
    let mut waprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
    waprobhash.hash_weigthed_idxmap(&wa);
    //
    let mut wbprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
    wbprobhash.hash_weigthed_idxmap(&wb);
    //
    let siga = waprobhash.get_signature();
    let sigb = wbprobhash.get_signature();
    let jp_approx = compute_probminhash_jaccard(siga, sigb);

An example of Probminhash3 with items sent one by one:

    let set_size = 100;
    let mut wa = Vec::<f64>::with_capacity(set_size);
    let mut wb = Vec::<f64>::with_capacity(set_size);
    // initialize wa, wb
    ....
    // probminhash
    let mut waprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
    for i in 0..set_size {
        if wa[i] > 0. {
            waprobhash.hash_item(i, wa[i]);
        }
    }
    //
    let mut wbprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
    for i in 0..set_size {
        if wb[i] > 0. {
            wbprobhash.hash_item(i, wb[i]);
        }
    }
    let siga = waprobhash.get_signature();
    let sigb = wbprobhash.get_signature();
    let jp_approx = compute_probminhash_jaccard(siga, sigb);
  • Superminhash
      let va : Vec<usize> = (0..1000).collect();
      let vb : Vec<usize> = (900..2000).collect();
      let bh = BuildHasherDefault::<FnvHasher>::default();
      let mut sminhash : SuperMinHash<usize, FnvHasher>= SuperMinHash::new(70, &bh);
      // now compute sketches
      let resa = sminhash.sketch_slice(&va);
      // we decide to reuse sminhash instead of allocating another SuperMinHash structure
      let ska = sminhash.get_hsketch().clone();
      sminhash.reinit();
      let resb = sminhash.sketch_slice(&vb);
      let skb = sminhash.get_hsketch();
      //
      let jac = get_jaccard_index_estimate(&ska, &skb).unwrap();
        ...

License

Licensed under either of

at your option.

Dependencies

~11MB
~211K SLoC