3 releases (breaking)

0.3.0	Jul 3, 2024
0.2.0	Feb 6, 2024
0.1.0	Sep 21, 2021

#1455 in Text processing

GPL-3.0+

32KB
519 lines

Lexmatch

This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus.

The implementation uses suffix arrays or hash tables. The text must be plain-text UTF-8. For the former implementation (default), it is limited to 2^32 bytes (about 4GB). For the latter implementation (--tokens/--cjk), there is no such limit. The offsets outputted will be UTF-8 byte positions.

This tool only does exact (or case insensitive) matching, if you need fuzzy matching against lexicons, check out analiticcl instead.

Installation

You can build and install the latest stable release using Rust's package manager:

cargo install lexmatch

or if you want the development version after cloning this repository:

cargo install --path .

No cargo/rust on your system yet? Do sudo apt install cargo on Debian/ubuntu based systems, brew install rust on mac, or use rustup.

Usage

See lexmatch --help.

Simple example:

$ lexmatch --lexicon lexicon.lst corpus.txt

The lexicon must be plain-text UTF-8 containing one entry per line, an entry need not be a single word and is not constrained in length. If the lexicon consists of Tab Separated Values (TSV), then only the first column is considered, the rest is ignored.

Instead of a lexicon you can also provide the patterns to query on the command line using --query.

By default, you will get a TSV file with a column for the text, the occurrence count, and one with the begin position (UTF-8 byte position) for each match (dynamic columns):

$ lexmatch --query good --query bad /nettmp/republic.short.txt 
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good    4       193     3307    3480    278
bad     3       201     3315    3488

Matching is case sensitive by default, add --no-case for case insensitive behaviour (all input and output will be lowercase, this may in rare cases cause the UTF-8 offsets to no longer be valid on the original text).

For verbose output, add --verbose. This produces cleaner TSV (tab seperated values) output that you can easily import in for example the STAM tools:

$ lexmatch --verbose --query good --query bad /nettmp/republic.short.txt
Text    BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good    193     197
good    3307    3311
good    3480    3484
good    278     282
bad     201     204
bad     3315    3318
bad     3488    3491

You may provide multiple lexicons as well as multiple test files, the output will output the lexicon and/or test file in such cases. If multiple lexicons match, they are all returned (delimited by a semicolon). The order of the results is arbitrary.

If you don't care for the exact positions but rather want to compute a frequency list with the number of occurrences for each item in the lexicon or passed through --query, then pass --count-only:

$ lexmatch --count-only --query good --query bad /tmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good    4
bad	3

You can configure a minimum frequency threshold using --freq.

Rather than match all of the lexicon against the text, you can also iterate over tokens in the text and check if they occur in the lexicon. This uses a hash map instead of a suffix array and is typically faster. It is more limited, however, and can not be used with frequency thresholds or counting. It will always produce verbose output (similar to --verbose):

$ lexmatch --tokens --query good --query bad /nettmp/republic.short.txt
Text    BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
good    193     197
bad     201     204
good    278     282
good    3307    3311
bad     3315    3318
good    3480    3484
bad     3488    3491

Unlike before, you will find the matches are now returned in reading order.

If you add --coverage then you will get an extra last line with some coverage statistics. This is useful to see how much of the text is covered by your lexicon.

#coverage (tokens) = 7/627 = 0.011164274322169059

Coverage can also be computed line-by-line and matching against multiple lexicons, we can also read directly from stdin rather than from file by passing - as filename:

$ echo "Is this good or bad?\nIt is quite good." | lexmatch --coverage-matrix --query good --query bad  -
Reading text from -...
Line            query
Is this good or bad?    0.4
It is quite good.       0.25

This can be used as a simple lexicon-based method for language detection:

$ echo "Do you know what language this is?\nUnd was ist das hier genau?\nÇa va assez bien je crois" | lexmatch -i  --coverage-matrix --lexicon ~X/en.lst --lexicon ~X/de.lst --lexicon ~X/fr.lst  -                     
Reading lexicon...
Reading lexicon...
Reading lexicon...
Reading text from -...
Line            /home/proycon/exp/en.lst        /home/proycon/exp/de.lst        /home/proycon/exp/fr.lst        Total
do you know what language this is?      1       0       0.14285714285714285     1.1428571428571428
und was ist das hier genau?     0.16666666666666666     0.8333333333333334      0.16666666666666666     1.1666666666666667
ça va assez bien je crois       0.2     0.2     0.6     1

When using --tokens (or --coverage-matrix) we rely on whitespace and punctuation to delimit tokens. This does not work for languages such as Chinese, Japanese and Korean that are not delimited in such a way. For such languages, similar linear search behaviour can be attained by passing --cjk instead, with an integer value representing the maximum character length to explore. A greedy search will then be performed that favours longer patterns over shorter ones.

Dependencies

~1.5MB
~25K SLoC