1 unstable release

0.1.0 Jul 5, 2020

#28 in #sketch

MIT license

15KB
299 lines

sketch-duplicates

Find duplicate lines probabilistically.

Motivation

Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in. The usual way to do this might look something like this:

zcat *.gz | sort | uniq -d

The problem with this, is that it can become very slow for large files. sketch-duplicates provides a way to remove most unique lines, leaving mostly duplicate lines in the output. sketch-duplicates is probabilistic and is therefore not guarenteed to remove all unique lines. It is therefore still necessary to have a sort | uniq -d in the end but this will be much faster due to the input having most unique lines removed.

The above can example can be written to use a sketch like this:

zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d

Multiple sketches can be combined using sketch-duplicates combine. This can be used to parallelize the construction of the sketch (here using GNU Parallel):

echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d

Options

  • -s, --size: Size of the sketch. Increasing this improves filtering accuracy but consumes more memory. This is set to a conservative default of 8MiB and can often be increased depending on the specific use case.
  • -p, --probes: Number of probes to do in the sketch.
  • -0, --zero-terminated: Use NULL bytes as line delimiters.

Install

Install Cargo (eg. using rustup), then run cargo install sketch-duplicates.

Dependencies

~4MB
~73K SLoC