3 releases (breaking)
0.3.0 | Jul 3, 2024 |
---|---|
0.2.0 | Feb 6, 2024 |
0.1.0 | Sep 21, 2021 |
#731 in Text processing
32KB
519 lines
Lexmatch
This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus.
The implementation uses suffix arrays or hash tables. The text must be
plain-text UTF-8. For the former implementation (default), it is limited to
2^32 bytes (about 4GB). For the latter implementation (--tokens
/--cjk
),
there is no such limit. The offsets outputted will be UTF-8 byte positions.
This tool only does exact (or case insensitive) matching, if you need fuzzy matching against lexicons, check out analiticcl instead.
Installation
You can build and install the latest stable release using Rust's package manager:
cargo install lexmatch
or if you want the development version after cloning this repository:
cargo install --path .
No cargo/rust on your system yet? Do sudo apt install cargo
on Debian/ubuntu based systems, brew install rust
on mac, or use rustup.
Usage
See lexmatch --help
.
Simple example:
$ lexmatch --lexicon lexicon.lst corpus.txt
The lexicon must be plain-text UTF-8 containing one entry per line, an entry need not be a single word and is not constrained in length. If the lexicon consists of Tab Separated Values (TSV), then only the first column is considered, the rest is ignored.
Instead of a lexicon you can also provide the patterns to query on the command line using --query
.
By default, you will get a TSV file with a column for the text, the occurrence count, and one with the begin position (UTF-8 byte position) for each match (dynamic columns):
$ lexmatch --query good --query bad /nettmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4 193 3307 3480 278
bad 3 201 3315 3488
Matching is case sensitive by default, add --no-case
for case insensitive
behaviour (all input and output will be lowercase, this may in rare cases cause
the UTF-8 offsets to no longer be valid on the original text).
For verbose output, add --verbose
. This produces cleaner TSV (tab seperated
values) output that you can easily import in for example the STAM
tools:
$ lexmatch --verbose --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 193 197
good 3307 3311
good 3480 3484
good 278 282
bad 201 204
bad 3315 3318
bad 3488 3491
You may provide multiple lexicons as well as multiple test files, the output will output the lexicon and/or test file in such cases. If multiple lexicons match, they are all returned (delimited by a semicolon). The order of the results is arbitrary.
If you don't care for the exact positions but rather want to compute a
frequency list with the number of occurrences for each item in the lexicon or
passed through --query
, then pass --count-only
:
$ lexmatch --count-only --query good --query bad /tmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4
bad 3
You can configure a minimum frequency threshold using --freq
.
Rather than match all of the lexicon against the text, you can also iterate
over tokens in the text and check if they occur in the lexicon. This uses a
hash map instead of a suffix array and is typically faster. It is more limited,
however, and can not be used with frequency thresholds or counting. It will
always produce verbose output (similar to --verbose
):
$ lexmatch --tokens --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
good 193 197
bad 201 204
good 278 282
good 3307 3311
bad 3315 3318
good 3480 3484
bad 3488 3491
Unlike before, you will find the matches are now returned in reading order.
If you add --coverage
then you will get an extra last line with some coverage
statistics. This is useful to see how much of the text is covered by your
lexicon.
#coverage (tokens) = 7/627 = 0.011164274322169059
Coverage can also be computed line-by-line and matching against multiple lexicons, we can also read directly from stdin rather than from file by passing -
as filename:
$ echo "Is this good or bad?\nIt is quite good." | lexmatch --coverage-matrix --query good --query bad -
Reading text from -...
Line query
Is this good or bad? 0.4
It is quite good. 0.25
This can be used as a simple lexicon-based method for language detection:
$ echo "Do you know what language this is?\nUnd was ist das hier genau?\nÇa va assez bien je crois" | lexmatch -i --coverage-matrix --lexicon ~X/en.lst --lexicon ~X/de.lst --lexicon ~X/fr.lst -
Reading lexicon...
Reading lexicon...
Reading lexicon...
Reading text from -...
Line /home/proycon/exp/en.lst /home/proycon/exp/de.lst /home/proycon/exp/fr.lst Total
do you know what language this is? 1 0 0.14285714285714285 1.1428571428571428
und was ist das hier genau? 0.16666666666666666 0.8333333333333334 0.16666666666666666 1.1666666666666667
ça va assez bien je crois 0.2 0.2 0.6 1
When using --tokens
(or --coverage-matrix
) we rely on whitespace and punctuation to delimit
tokens. This does not work for languages such as Chinese, Japanese and Korean
that are not delimited in such a way. For such languages, similar linear search
behaviour can be attained by passing --cjk
instead, with an integer value
representing the maximum character length to explore. A greedy search will then
be performed that favours longer patterns over shorter ones.
Dependencies
~1.5MB
~24K SLoC