#unicode-text #unicode #grapheme #boundary #word #character #text

unic-segment

UNIC — Unicode Text Segmentation Algorithms

3 releases (breaking)

0.9.0 Mar 3, 2019
0.8.0 Jan 2, 2019
0.7.0 Feb 7, 2018

#339 in Internationalization (i18n)

Download history 99206/week @ 2024-10-04 105824/week @ 2024-10-11 113207/week @ 2024-10-18 105870/week @ 2024-10-25 101586/week @ 2024-11-01 116677/week @ 2024-11-08 153904/week @ 2024-11-15 120026/week @ 2024-11-22 138117/week @ 2024-11-29 140861/week @ 2024-12-06 132363/week @ 2024-12-13 65295/week @ 2024-12-20 61290/week @ 2024-12-27 115684/week @ 2025-01-03 136341/week @ 2025-01-10 104216/week @ 2025-01-17

424,914 downloads per month
Used in 850 crates (8 directly)

MIT/Apache

110KB
1.5K SLoC

UNIC — Unicode Text Segmentation Algorithms

Crates.io Documentation

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.

Notes

Initial code for this component is based on unicode-segmentation.


lib.rs:

UNIC — Unicode Text Segmentation Algorithms

A component of unic: Unicode and Internationalization Crates for Rust.

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).

Examples

assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

Dependencies