#bloom-filter #collection #probabilistic #approximate #big-data #count-min-sketch

probabilistic-collections

Various implementations of collections that use approximations to improve on running time or memory, but introduce a certain amount of error

7 releases (breaking)

0.7.0 May 11, 2020
0.6.0 Apr 24, 2020
0.5.0 Nov 4, 2018
0.4.0 Sep 9, 2018
0.1.0 Sep 6, 2018

#876 in Data structures

Download history 876/week @ 2024-07-26 624/week @ 2024-08-02 702/week @ 2024-08-09 1246/week @ 2024-08-16 853/week @ 2024-08-23 1393/week @ 2024-08-30 1174/week @ 2024-09-06 910/week @ 2024-09-13 1217/week @ 2024-09-20 762/week @ 2024-09-27 861/week @ 2024-10-04 977/week @ 2024-10-11 718/week @ 2024-10-18 1087/week @ 2024-10-25 918/week @ 2024-11-01 1425/week @ 2024-11-08

4,370 downloads per month
Used in 8 crates (5 directly)

MIT/Apache

270KB
4.5K SLoC

probabilistic-collections-rs

Documentation License: MIT License: Apache 2.0 Pipeline Status Coverage Report

probabilistic-collections contains various implementations of collections that use approximations to improve on running time or memory, but introduce a certain amount of error. The error can be controlled under a certain threshold which makes these data structures extremely useful for big data and streaming applications.

The following types of collections are implemented:

  • Approximate Membership in Set: BloomFilter, PartitionedBloomFilter, CuckooFilter, QuotientFilter
  • Scalable Approximate Membership in Set: ScalableBloomFilter, ScalableCuckooFilter
  • Approximate Membership in Stream: BSBloomFilter, BSSDBloomFilter, RLBSBloomFilter
  • Approximate Item Count: CountMinSketch
  • Approximate Distinct Item Count: HyperLogLog
  • Set similarity: MinHash, SimHash

Usage

Add this to your Cargo.toml:

[dependencies]
probabilistic-collections = "*"

For serde support, include the serde feature:

[dependencies]
probabilistic-collections = { version = "*", features = ["serde"] }

Add this to your crate root if you are using Rust 2015:

extern crate probabilistic_collections;

Caveats

If you are using this crate to create collections to be used across different platforms, you must be careful not to use keys of type [T]. The Rust standard library implementation of Hash for [T] first hashes the length of the slice, then the contents of the slice. The length is platform specific because it is of type usize. Therefore, a collection with keys of type [T] will have unexpected results when used across platforms. For example, if you were to generate and serilize a BloomFilter on i686 compiled code where the keys are of type [u8], the filter will not have the correct results when deserialized on x86_64 compiled code.

The recommended work-around to this problem is to a define a wrapper struct around a [T] with a Hash implementation that only hashes the contents of the slice.

Changelog

See CHANGELOG for more details.

References

License

probabilistic-collections-rs is dual-licensed under the terms of either the MIT License or the Apache License (Version 2.0).

See LICENSE-APACHE and LICENSE-MIT for more details.

Dependencies

~1.8–2.6MB
~49K SLoC