#edit-distance #nlp #lemmatization #linguistics #command-line-tool

bin+lib sesdiff

Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein).

8 releases

0.3.1 Oct 14, 2024
0.3.0 Apr 22, 2021
0.2.0 Apr 21, 2021
0.1.5 Sep 8, 2020
0.1.2 Aug 14, 2020

#824 in Text processing


Used in analiticcl

GPL-3.0+

42KB
803 lines

Crate GitHub build GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

sesdiff: Shortest Edit Script Diff

Description

This is a small and fast command line tool and Rust library that reads a two-column tab separated input from standard input and computes the shortest edit script (Myers' diff algorithm) to go from the string in column A to the string in column B. It also computes the edit distance (aka levenshtein distance).

There is also a python binding available if you want to use sesdiff from Python. The documentation here covers the command-line version.

It was written to build lemmatisers.

Installation

Install it using Rust's package manager:

cargo install sesdiff

No cargo/rust on your system yet? Do sudo apt install cargo on Debian/ubuntu based systems, brew install rust on mac, or use rustup.

This tool builds upon Dissimilar that provides the actual diff algorithm (will be downloaded and compiled in automatically).

Usage

$ sesdiff < input.tsv

Example input and output (reformatted for legibility, the first two columns correspond to the input). Output is in a four-column tab separated format:

hablaron        hablar     =[hablar]-[on]                  2
contaron        contar     =[contar]-[on]                  2
pidieron        pedir      =[p]-[i]+[e]=[di]-[eron]+[r]    6
говорим         говорить   =[говори]-[м]+[ть]              3

By default the full edit script will be provided in a simple language:

  • =[] - The text between brackets is identical in strings A and B
    • =[#n] - If you use the --abstract parameter, this will be used instead, where n represents a number indicating the length of text between that is identical in strings A and B
  • -[] - The text between brackets is removed to get to string B
  • +[] - The text between brackets is added to get to string B

For lemmatisation purposes, it makes sense for many languages to look at suffixes (from right to left) and strip common prefixes. Pass the --suffix option for that behaviour and output is now:

$ sesdiff --suffix < input.tsv
hablaron        hablar          -[on]                      2
contaron        contar          -[on]                      2
pidieron        pedir           -[eron]+[r]=[di]-[i]+[e]   6
говорим         говорить        -[м]+[ть]                  3

Note that the edit scripts in suffix mode are formulated differently than in normal mode (they start from the right too). There is also a --prefix option that strips common suffixes.

Use the --abstract parameter to get a slightly more abstract edit script that refers to the length of unchanged parts rather than their contents. You would then get:

pidieron        pedir           -[eron]+[r]=[#2]-[i]+[e]   6

Sesdiff can also apply edit scripts to our input, use the --apply flag and feed the tool tab separated input with a string in the first column and an edit script in the second, as in the the following example input2.tsv:

$ cat input2.tsv
pidieron        -[eron]+[r]=[di]-[i]+[e]

Run sesdiff as follows and a third column will be added with the solution:

$ sesdiff --suffix --apply < input2.tsv
pidieron        -[eron]+[r]=[di]-[i]+[e]                pedir

When using --apply, you can also make use of an extra --infix parameter to indicate that an edit script must be attempted to be matched with any infix in the string, including multiple. Consider the following example that replaces all letters a with o:

$ cat input3.tsv
hahaha       -[a]+[o]

$ sesdiff --infix --apply < input3.tsv
hahaha       -[a]+[o]	hohoho

In --apply mode, you can also make edit scripts applicable to multiple patterns by using the | operator. This is only allowed for deletions (-[]) and equality checks (=[]):

$ cat input4.tsv
hihaho       -[a|i|o]+[e]

$ sesdiff --infix --apply < input4.tsv
hihaho       -[a|i|o]+[e]	hehehe

License

GNU General Public Licence v3

Dependencies

~2MB
~27K SLoC