1 unstable release
0.1.0 | Jul 5, 2020 |
---|
#28 in #sketch
15KB
299 lines
sketch-duplicates
Find duplicate lines probabilistically.
Motivation
Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in. The usual way to do this might look something like this:
zcat *.gz | sort | uniq -d
The problem with this, is that it can become very slow for large files. sketch-duplicates
provides
a way to remove most unique lines, leaving mostly duplicate lines in the output.
sketch-duplicates
is probabilistic and is therefore not guarenteed to remove all unique lines.
It is therefore still necessary to have a sort | uniq -d
in the end but this will be much faster
due to the input having most unique lines removed.
The above can example can be written to use a sketch like this:
zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d
Multiple sketches can be combined using sketch-duplicates combine
. This can be used to parallelize
the construction of the sketch (here using GNU Parallel):
echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d
Options
-s
,--size
: Size of the sketch. Increasing this improves filtering accuracy but consumes more memory. This is set to a conservative default of 8MiB and can often be increased depending on the specific use case.-p
,--probes
: Number of probes to do in the sketch.-0
,--zero-terminated
: Use NULL bytes as line delimiters.
Install
Install Cargo (eg. using rustup), then run
cargo install sketch-duplicates
.
Dependencies
~4MB
~73K SLoC