7 releases (breaking)
0.7.0 | May 11, 2020 |
---|---|
0.6.0 | Apr 24, 2020 |
0.5.0 | Nov 4, 2018 |
0.4.0 | Sep 9, 2018 |
0.1.0 | Sep 6, 2018 |
#876 in Data structures
4,370 downloads per month
Used in 8 crates
(5 directly)
270KB
4.5K
SLoC
probabilistic-collections-rs
probabilistic-collections
contains various implementations of collections that use approximations
to improve on running time or memory, but introduce a certain amount of error. The error can be
controlled under a certain threshold which makes these data structures extremely useful for big data
and streaming applications.
The following types of collections are implemented:
- Approximate Membership in Set:
BloomFilter
,PartitionedBloomFilter
,CuckooFilter
,QuotientFilter
- Scalable Approximate Membership in Set:
ScalableBloomFilter
,ScalableCuckooFilter
- Approximate Membership in Stream:
BSBloomFilter
,BSSDBloomFilter
,RLBSBloomFilter
- Approximate Item Count:
CountMinSketch
- Approximate Distinct Item Count:
HyperLogLog
- Set similarity:
MinHash
,SimHash
Usage
Add this to your Cargo.toml
:
[dependencies]
probabilistic-collections = "*"
For serde
support, include the serde
feature:
[dependencies]
probabilistic-collections = { version = "*", features = ["serde"] }
Add this to your crate root if you are using Rust 2015:
extern crate probabilistic_collections;
Caveats
If you are using this crate to create collections to be used across different platforms, you must be
careful not to use keys of type [T]
. The Rust standard library implementation of Hash
for [T]
first hashes the length of the slice, then the contents of the slice. The length is platform
specific because it is of type usize
. Therefore, a collection with keys of type [T]
will have
unexpected results when used across platforms. For example, if you were to generate and serilize a
BloomFilter
on i686 compiled code where the keys are of type [u8]
, the filter will not have the
correct results when deserialized on x86_64 compiled code.
The recommended work-around to this problem is to a define a wrapper struct around a [T]
with a
Hash
implementation that only hashes the contents of the slice.
Changelog
See CHANGELOG for more details.
References
- Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
Bera, Suman K., Sourav Dutta, Ankur Narang, and Souvik Bhattacherjee. 2012. "Advanced Bloom Filter Based Algorithms for Efficient Approximate Data de-Duplication in Streams." CoRR abs/1212.3964. http://arxiv.org/abs/1212.3964.
- An improved data stream summary: the count-min sketch and its applications
Cormode, Graham, and S. Muthukrishnan. 2005. "An Improved Data Stream Summary: The Count-Min Sketch and Its Applications." J. Algorithms 55 (1). Duluth, MN, USA: Academic Press, Inc.: 58--75. https://doi.org/10.1016/j.jalgor.2003.12.001.
- Cuckoo Filter: Practically Better Than Bloom
Fan, Bin, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. "Cuckoo Filter: Practically Better Than Bloom." In Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies, 75--88. CoNEXT '14. New York, NY, USA: ACM. https://doi.org/10.1145/2674005.2674994.
- Don't thrash: how to cache your hash on flash
Bender, Michael A., Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. "Don'T Thrash: How to Cache Your Hash on Flash." Proc. VLDB Endow. 5 (11). VLDB Endowment: 1627--37. https://doi.org/10.14778/2350229.2350275.
- HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm
Heule, Stefan, Marc Nunkesser, and Alexander Hall. 2013. "HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, 683--92. EDBT '13. New York, NY, USA: ACM. https://doi.org/10.1145/2452376.2452456.
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. "Hyperloglog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm." In IN Aofa '07: PROCEEDINGS of the 2007 International Conference on Analysis of Algorithms.
- Less hashing, same performance: Building a better Bloom filter
Kirsch, Adam, and Michael Mitzenmacher. 2008. "Less Hashing, Same Performance: Building a Better Bloom Filter." Random Struct. Algorithms 33 (2). New York, NY, USA: John Wiley & Sons, Inc.: 187--218. https://doi.org/10.1002/rsa.v33:2.
- Min-wise independent permutations (extended abstract)
Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. "Min-Wise Independent Permutations (Extended Abstract)." In Proceedings of the Thirtieth Annual Acm Symposium on Theory of Computing, 327--36. STOC '98. New York, NY, USA: ACM. https://doi.org/10.1145/276698.276781.
- Probabilistic near-duplicate detection using simhash
Sood, Sadhan, and Dmitri Loguinov. 2011. "Probabilistic Near-Duplicate Detection Using Simhash." In Proceedings of the 20th Acm International Conference on Information and Knowledge Management, 1117--26. CIKM '11. New York, NY, USA: ACM. https://doi.org/10.1145/2063576.2063737.
- Scalable Bloom Filters
Almeida, Paulo Sérgio, Carlos Baquero, Nuno Preguiça, and David Hutchison. 2007. "Scalable Bloom Filters." Inf. Process. Lett. 101 (6). Amsterdam, The Netherlands, The Netherlands: Elsevier North-Holland, Inc.: 255--61. https://doi.org/10.1016/j.ipl.2006.10.007.
License
probabilistic-collections-rs
is dual-licensed under the terms of either the MIT License or the
Apache License (Version 2.0).
See LICENSE-APACHE and LICENSE-MIT for more details.
Dependencies
~1.8–2.6MB
~49K SLoC