#unicode #category #properties #no-std #general

no-std unicode-general-category

Fast lookup of the Unicode General Category property for char

8 releases (1 stable)

1.0.0 Oct 30, 2024
1.0.0-pre Oct 8, 2024
0.6.0 Sep 19, 2022
0.5.1 Jan 25, 2022
0.1.0 Dec 13, 2019

#40 in Text processing

Download history 9009/week @ 2024-07-30 8104/week @ 2024-08-06 8095/week @ 2024-08-13 8221/week @ 2024-08-20 8512/week @ 2024-08-27 7760/week @ 2024-09-03 8870/week @ 2024-09-10 6741/week @ 2024-09-17 7351/week @ 2024-09-24 8384/week @ 2024-10-01 6580/week @ 2024-10-08 6690/week @ 2024-10-15 11857/week @ 2024-10-22 16167/week @ 2024-10-29 13417/week @ 2024-11-05 9494/week @ 2024-11-12

52,313 downloads per month
Used in 137 crates (11 directly)

Apache-2.0

225KB
4.5K SLoC

unicode-general-category


Fast lookup of the Unicode General Category property for char in Rust using Unicode 16.0 data. This crate is no-std compatible.

Usage

use unicode_general_category::{get_general_category, GeneralCategory};

fn main() {
    assert_eq!(get_general_category('A'), GeneralCategory::UppercaseLetter);
}

Performance & Implementation Notes

ucd-generate is used to generate tables.rs. A build script (build.rs) compiles this into a two level look up table. The look up time is constant as it is just indexing into two arrays.

The two level approach maps a code point to a block, then to a position within a block. The allows the second level of block to be deduplicated, saving space. The code is parameterised over the block size, which must be a power of 2. The value in the build script is optimal for the data set.

This approach trades off some space for faster lookups. The tables take up about 45KiB. Benchmarks showed this approach to be ~5–10× faster than the typical binary search approach.

It's possible there are further optimisations that could be made to eliminate some runs of repeated values in the first level array.

No runtime deps