#interning #string-interning #intern

intern-arc

An interner that deallocates unused values

10 releases (5 breaking)

0.6.1 Oct 2, 2023
0.6.0 Apr 2, 2023
0.5.1 Mar 30, 2023
0.5.0 Mar 14, 2021
0.1.0 Jun 9, 2020

#643 in Rust patterns

Download history 10/week @ 2024-06-22 15/week @ 2024-06-29 158/week @ 2024-07-06 221/week @ 2024-07-13 102/week @ 2024-07-20 150/week @ 2024-07-27 209/week @ 2024-08-03 184/week @ 2024-08-10 8/week @ 2024-08-17 68/week @ 2024-08-24 43/week @ 2024-08-31 9/week @ 2024-09-07 42/week @ 2024-09-14 46/week @ 2024-09-21 10/week @ 2024-09-28 39/week @ 2024-10-05

137 downloads per month
Used in 8 crates (3 directly)

Apache-2.0 OR MIT

54KB
1K SLoC

Interning library based on atomic reference counting

For the docs see docs.rs.

When developing please make use of cargo bench and cargo +nightly miri — the multithreading tests are sized such that they bring issues to the surface when run with miri (which is rather slow). Once you notice problems, debugging is easier with

export MIRIFLAGS=-Zmiri-disable-isolation
cargo +nightly miri test --features println -- --nocapture multithreading_hash

Another very useful tool is loom, run it with

LOOM_LOG=1 LOOM_LOCATION=1 LOOM_CHECKPOINT_INTERVAL=1 RUST_BACKTRACE=1 RUSTFLAGS="--cfg loom" cargo test --lib

Note: loom currently does not work because it doesn’t support Weak references.

License

At your option: Apache-2.0 or MIT


lib.rs:

Interning library based on atomic reference counting

This library differs from arc-interner (which served as initial inspiration) in that

  • interners must be created and can be dropped (for convenience functions see below)
  • it does not dispatch based on TypeId (each interner is for exactly one type)
  • it offers both Hash-based and Ord-based storage
  • it handles unsized types without overhead, so you should inline str instead of String

Unfortunately, this combination of features makes it inevitable to use unsafe Rust. The construction of unsized values has been adapted from the standard library’s Arc type; reference counting employs a parking_lot mutex since it must also ensure consistent access to and usage of the cleanup function that will remove the value from its interner. The test suite passes also under miri to check against some classes of undefined behavior in the unsafe code (including memory leaks).

API flavors

The same API is provided in two flavors:

Interning small values takes of the order of 100–200ns on a typical server CPU. The ord-based storage has an advantage when interning large values (like slices greater than 1kB). The hash-based storage has an advantage when keeping lots of values (many thousands and up) interned at the same time.

Nothing beats your own benchmarking, though.

Convenience access to interners

When employing interning for example within a dedicated deserialiser thread, it is best to create and locally use an interner, avoiding further synchronisation overhead. You can also store interners in thread-local variables if you only care about deduplication per thread.

That said, this crate also provides convenience functions based on a global type-indexed pool:

use intern_arc::{global::hash_interner};

let i1 = hash_interner().intern_ref("hello"); // -> InternedHash<str>
let i2 = hash_interner().intern_sized(vec![1, 2, 3]); // -> InternedHash<Vec<i32>>

You can also use the type-indexed pool yourself to control its lifetime:

use intern_arc::{global::OrdInternerPool, InternedOrd};

let mut pool = OrdInternerPool::new();
let i: InternedOrd<[u8]> = pool.get_or_create().intern_box(vec![1, 2, 3].into());

Dependencies

~0.4–24MB
~343K SLoC