33 releases (major breaking)
53.0.0 | Sep 17, 2024 |
---|---|
52.0.0 | Jun 9, 2024 |
51.0.0 | Mar 31, 2024 |
50.0.0 | Feb 2, 2024 |
0.6.1 | Dec 13, 2021 |
#209 in Cryptography
840 downloads per month
53KB
978 lines
arrow-digest
Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data.
Motivation
Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version.
This crate provides a method and implementation for computing stable hashes of structured data (logical hash) based on Apache Arrow in-memory format.
Benefits:
- Fast way to check for equality / equivalence of large datasets
- Two parties can compare data without needing to transfer it or reveal its contents
- A step towards content addressability of structured data (e.g. when storing dataset chunks in DHTs like IPFS)
Use
// Hash single array
let array = Int32Array::from(vec![1, 2, 3]);
let digest = ArrayDigestV0::<Sha3_256>::digest(&array);
println!("{:x}", digest);
// Alternatively: Use `.update(&array)` to hash multiple arrays of the same type
// Hash record batches
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Utf8, false),
]));
let record_batch = RecordBatch::try_new(Arc::new(schema), vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["a", "b", "c"])),
]).unwrap();
let digest = RecordsDigestV0::<Sha3_256>::digest(&record_batch);
println!("{:x}", digest);
// Alternatively: Use `.update(&batch)` to hash multiple batches with same schema
Status
While we're working towards v1
we reserve the right to break the hash stability. Create an issue if you're planning to use this crate.
- Schema hashing
- Fixed size types
- Nullability: primitive types
- Binaries
- Uft8 variants
- Structs
- Nested structs
- Nullability: nested structs
- Lists
- Lists of structs
- Dictionaries
- Intervals
- Unions
- Maps
- Metadata endianness check
- Better test coverage + fuzzing
- Performance: Benchmarks
- Performance: Parallelism
- Performance: Code optimization
Design Goals
- Be reasonably fast
- Same hash no matter how many batches the input was split into
- Same hash no matter if dictionary encoding is used
Drawbacks
- Logical hasing stops short of perfect content addressibility
- Logical hashing would need to be supported by
IPFS
and the likes, but this is a stretch as this is not a general-purpose hashing algo - A fully deterministic binary encoding with Parquet compatibility may be a better approach
- Logical hashing would need to be supported by
- Proposed method is order-dependent - it will produce different hashes if records are reordered
- Boolean hashing could be more efficient
Hashing Process
Starting from primitives and building up:
- Endinanness - always assume little endian
- Fixed Size Types
Int, FloatingPoint, Decimal, Date, Time, Timestamp
- hashed using their in-memory binary representationBool
- hash the individual values as byte-sized values1
forfalse
and2
fortrue
- Variable Size Types
Binary, LargeBinary, FixedSizeBinary, Utf8, LargeUtf8
- hash length (asu64
) followed by in-memory representation of the valueList, LargeList, FixedSizeList
- hash length of the list (asu64
) followed by the hash of the sub-array list according to its data type
- Nullability - every null value is represented by a
0
(zero) byte- Arrays without validity bitmap have same hashes as arrays that do and all items are valid
- Array Data
- (once per hashing session) Hash data type according to the table below
- Hash items sequentially using the above rules
- Record Batch Data
- (once per hashing session) For every field hash
filed_name as utf8
,nesting_level (zero-based) as u64
recursively traversing the schema in the depth-first order - For every leaf column:
- Produce a combined nullability bitmap from nullability of every parent
- Update corresponding column's hasher using above rules
- (final step) Digests of every array are fed into the combined hasher to produce the final digest
- (once per hashing session) For every field hash
Type (in Schema.fb ) |
TypeID (as u16 ) |
Followed by |
---|---|---|
Null | 0 | |
Int | 1 | unsigned/signed (0/1) as u8 , bitwidth as u64 |
FloatingPoint | 2 | bitwidth as u64 |
Binary | 3 | |
Utf8 | 4 | |
Bool | 5 | |
Decimal | 6 | bitwidth as u64 , precision as u64 , scale as u64 |
Date | 7 | bitwidth as u64 , DateUnitID |
Time | 8 | bitwidth as u64 , TimeUnitID |
Timestamp | 9 | TimeUnitID , timeZone as nullable Utf8 |
Interval | 10 | |
List | 11 | items data type |
Struct | 12 | |
Union | 13 | |
FixedSizeBinary | 3 | |
FixedSizeList | 11 | items data type |
Map | 16 | |
Duration | 17 | |
LargeBinary | 3 | |
LargeUtf8 | 4 | |
LargeList | 11 | items data type |
Note that some types (Utf8
and LargeUtf8
, Binary
FixedSizeBinary
and LargeBinary
, List
FixedSizeList
and LargeList
) are represented in the hash the same, as the difference between them is purely an encoding concern.
DateUnit (in Schema.fb ) |
DateUnitID (as u16 ) |
---|---|
DAY | 0 |
MILLISECOND | 1 |
TimeUnit (in Schema.fb ) |
TimeUnitID (as u16 ) |
---|---|
SECOND | 0 |
MILLISECOND | 1 |
MICROSECOND | 2 |
NANOSECOND | 3 |
References
Dependencies
~13–20MB
~267K SLoC