#key-value #binary-search #search-file #map #file #cargo-toml

yanked catalog

A file-based map to store key/value pairs in a file and get them using binary search

Uses old Rust 2015

0.1.2 Sep 9, 2017
0.1.1 Jul 28, 2016
0.1.0 Jul 25, 2016

#25 in #search-file

MIT license

29KB
313 lines

rust-catalog

A "file-backed" map, which inserts keys and values into a file in O(n) time, and gets the values in O(log-n) time using binary search and file seeking. For now, it only supports insertion and getting of (hashable) keys and values that implement the Display and FromStr traits (i.e., those which can be converted to string and parsed back from string).

See the module documentation for more information.

Usage

Note that this was an experiment, and so use it at your own risk!

Add the following to your Cargo.toml...

catalog = "0.1.2"

Have a look at the detailed example for the precise usage.


lib.rs:

The HashMap and BTreeMap in the standard library offer very good performance when it comes to inserting and getting stuff, but they're memory killers. If the "stuff" gets large - say, a trillion (1012) of them, then we're gonna be in trouble, as we'll then be needing gigs of RAM to hold the data.

Moreover, once the program quits, all the hard-earned stuff gets deallocated, and we'd have to re-insert them allover again. HashFile deals with this specific problem. It makes use of a BTreeMap for storing the keys and values. So, until it reaches the defined capacity, it offers the same performance as that of the btree-map. However, once (and whenever) it reaches the capacity, it flushes the stuff to a file (the necessary parameters can be defined in its methods).

Hence, at any given moment, the upper limit for the memory eaten by this thing is set by its capacity. This gives us good control over the space-time trade-off. But, the flushing will take O(2n) time, depending on the processor and I/O speed, as it does things on the fly with the help of iterators.

After the final manual flush, the file can be stored, moved around, and since it makes use of binary search, values can be obtained in O(log-n) time whenever they're required (depending on the seeking speed of the drive). For example, an average seek takes around 0.3 ms, and a file containing a trillion values demands about 40 seeks (in the worse case), which translates to 12 ms.

This kind of "search and seek" is already being used by databases. But, the system is simply an unnecessary complication if you just want a table with a zillion rows and only two columns (a key and a value).

See the HashFile type for more info.

Dependencies

~57KB