5 releases

0.2.3	Nov 23, 2024
0.2.2	May 28, 2024
0.2.1	Apr 14, 2023
0.2.0	Jan 26, 2023
0.1.1	May 17, 2022

#787 in Algorithms

Unicode-3.0

87KB
1.5K SLoC

bies

The algorithms in this project convert from a BIES matrix (the output of the LSTM segmentation neural network) to concrete segment boundaries. In BIES, B = beginning of segment; I = inside segment; E = end of segment; and S = single segment (both beginning and end).

These algorithms always produce valid breakpoint positions (at grapheme cluster boundaries); they don't assume that the neural network always predicts valid positions.

Example

For example, suppose you had the following BIES matrix:

|   B   |   I   |   E   |   S   |
|-------|-------|-------|-------|
| 0.01  | 0.01  | 0.01  | 0.97  |
| 0.97  | 0.01  | 0.01  | 0.01  |
| 0.01  | 0.97  | 0.01  | 0.01  |
| 0.01  | 0.97  | 0.01  | 0.01  |
| 0.01  | 0.01  | 0.97  | 0.01  |
| 0.01  | 0.01  | 0.01  | 0.97  |
| 0.97  | 0.01  | 0.01  | 0.01  |
| 0.01  | 0.01  | 0.97  | 0.01  |

This matrix resolves to:

01234567
SBIIESBE

The breakpoints are then: 0, 1, 5, and 8 (four segments).

However, it could be the case that the algorithm's BIES are invalid. For example, "BEE" is invalid, because the second "E" does not terminate any word. The purpose of the algorithms in this project is to guarantee that valid breakpoints and BIES are always outputted.

Algorithms

The following algorithms are implemented:

1a: Step through each grapheme cluster boundary in the string. Look at the BIES vectors for the code points surrounding the boundary. The only valid results at that boundary are {EB, ES, SB, SS} (breakpoint) or {II, BI, IE, BE} (no breakpoint). Take the sum of the valid breakpoint and no-breakpoint probabilities, and decide whether to insert a breakpoint based on which sum is higher. Repeat for all grapheme cluster boundaries in the string. The output is a list of word boundaries, which can be converted back into BIES if desired.

1b: Same as 1a, but instead of taking the sum, take the individual maximum.

2a: Step through each element in the BIES sequence. For each element, look at the triplet containing the element and both of its neighbors. By induction, assume the first element in the triplet is correct. Now, depending on whether there is a code point boundary following the element, calculate the probabilities of all valid BIES for the triplet, and based on those results, pick the most likely value for the current element.

3a: Exhaustively check the probabilities of all possible BIES for the string. This algorithm has exponential runtime.

More Information

For more information on development, authorship, contributing etc. please visit ICU4X home page.

Dependencies

~2.5MB
~52K SLoC