#unicode #grapheme #reader #unicode-text #codepoint #text

unicode_reader

Adaptors which wrap byte-oriented readers and yield the UTF-8 data as Unicode code points or grapheme clusters

5 releases (3 stable)

1.0.2 Nov 19, 2021
1.0.1 Jun 24, 2020
1.0.0 Jun 14, 2019
0.1.1 Feb 7, 2018
0.1.0 Sep 20, 2016

#1310 in Text processing

Download history 298/week @ 2024-07-19 145/week @ 2024-07-26 128/week @ 2024-08-02 137/week @ 2024-08-09 119/week @ 2024-08-16 145/week @ 2024-08-23 133/week @ 2024-08-30 136/week @ 2024-09-06 146/week @ 2024-09-13 187/week @ 2024-09-20 287/week @ 2024-09-27 143/week @ 2024-10-04 14/week @ 2024-10-11 109/week @ 2024-10-18 103/week @ 2024-10-25 128/week @ 2024-11-01

362 downloads per month
Used in 14 crates (9 directly)

MIT/Apache

20KB
243 lines

unicode_reader

Build Status

Documentation

Adaptors which wrap byte-oriented readers and yield the UTF-8 data as Unicode code points or grapheme clusters.

Unlike other Unicode parsers which work on strings (for instance, unicode_segmentation, upon which this is built), this crate works on streams and doesn't require reading the entire data into memory. Instead it yields the graphemes or code points as it reads them.

extern crate unicode_reader;
use unicode_reader::Graphemes;

use std::io::Cursor;

fn main() {
    let input = Cursor::new("He\u{302}\u{320}llo");
    let mut graphemes = Graphemes::from(input);
    assert_eq!("H",                 graphemes.next().unwrap().unwrap());
    assert_eq!("e\u{302}\u{320}",   graphemes.next().unwrap().unwrap()); // note 3 characters
    assert_eq!("l",                 graphemes.next().unwrap().unwrap());
    assert_eq!("l",                 graphemes.next().unwrap().unwrap());
    assert_eq!("o",                 graphemes.next().unwrap().unwrap());
    assert!(graphemes.next().is_none());

    let greek_bytes = vec![0xCE, 0xA7, 0xCE, 0xB1, 0xCE, 0xAF, 0xCF, 0x81, 0xCE, 0xB5,
                           0xCF, 0x84, 0xCE, 0xB5];
    let mut codepoints = CodePoints::from(Cursor::new(greek_bytes));
    assert_eq!(vec!['Χ', 'α', 'ί', 'ρ', 'ε', 'τ', 'ε'],
                codepoints.map(|r| r.unwrap())
                          .collect::<Vec<char>>());
}

Dependencies

~420KB