7 releases
0.1.0 | Dec 9, 2023 |
---|---|
0.0.6 | Nov 28, 2023 |
135KB
3.5K
SLoC
sqlite-collections
Rust collection types backed by sqlite database files.
This provides some standard-library-like collections, which may be serialized
and deserialized with arbitrary serializers. This allows you to use an
interface very similar to the std::collections
ones, with these
characteristics:
- You can persist your collections to disk without serializing and deserializing
the whole set. Opening a very large collection and making a small change
is very efficient compared to just using
serde
to load and dump the whole thing. - You can store very large structures, as big as your hard drive can handle, instead of just your memory. This handles many hundreds of gigabytes in the same way that plain SQLite does.
- You can use transactions and savepoints to roll back changes to any of your collections, or even many of them together.
- You can keep your collections across multiple files, with transactional integrity of them all together.
Portability and stability
ds::set::DSSet
and ds::map::DSMap
types depend on deterministic
serialization. You aren't prevented from storing whatever is serializable in a
Set, for instance, but keep in mind:
- If you store something like a HashMap, its order can change between runs.
- Some formats have multiple representations for the same data.
- Ciborium (the
cbor
feature) does not support half-precision floating point yet, so it being added would break determinism in the future.- In this way, ciborium doesn't implement CBOR's deterministic encoding.
- serde_json choosing to escape or not escape some unicode values in strings across versions can also break.
- Accessing the containers from another programming language that doesn't serialize the same way will cause issues
- Ciborium (the
- Some serializers will serialize differently on different architectures. If your serializer doesn't behave consistently on different endianness, it will not be portable across these different architectures.
- Some datatypes are different on different architectures, primarily usize and isize. Some serialization formats will encode these differently.
To be safe, make sure you do not update your serializers without some thorough
testing. Direct
is always safe. postmark
is probably the most reliable
serde
format you can use, keeping in mind to not depend on types that use
the Hash
trait to determine ordering unless you can ensure a consistent order
separately. cbor
and json
are useful for inter-language use, but keep in
mind the caveats, and make sure that other languages serialize the exact same
way.
In the future, a BTreeSet and BTreeMap will be added to support non-
deterministic encodings. These will be slower, but will be guaranteed to match
just based on Ord
and Eq
alone. This will not solve platform portability
problems, but will solve deterministic serialization concerns.
Concurrency safety
Your collections might be open in a different thread or process through another connection. SAVEPOINT is used internally to prevent inconsistent states.
Performance
This library uses internal SAVEPOINTs to prevent inconsistent states of the database. To get the most performance without sacrificing reliability:
- Use as large a transaction as you reasonably use around all operations.
- Use an IMMEDIATE transaction when you know you will modify the database (upgrading transactions may deadlock and cause errors).
- Switching the database into
journal_mode = WAL
withsynchronized = NORMAL
can give some performance gains when there is a lot of writing.
Collections
Currently implemented
- DSSet
- A regular set, sorted lexicographically by the stored representation.
- Requires deterministic serialization.
- Not completed yet. Most necessary functionality is present, but not all functionality.
To be implemented
- DSMap
- A regular map, sorted lexicographically by the stored representation of the key.
- Requires deterministic serialization.
- BTreeSet
- A set allowing non-deterministic serialization, as long as
Ord
is implemented. Should be quite a lot slower than a normal Set, as every comparison requires a full deserialization.
- A set allowing non-deterministic serialization, as long as
- BTreeMap
- A map allowing non-deterministic serialization.
- List
- A sequence ordered by an integer index.
- This is not called
Vec
because it's not really an array and is not accessible as a contiguous slice.
- Deque
- A sequence ordered by an integer index, with efficient insertion and removal at both ends.
Can't I just use SQLite directly?
Yes. This is mostly to make it easy to efficiently interact with a SQLite-
backed collection without having to think too hard about the SQL details, as
well as making it not-too-painful to swap out an existing std::collections
struct and keep the same functionality, when you need persistence and/or huge
data without filling your RAM.
If you really just want a large, persistent, and/or transaction-safe set of collections and don't need any other RDBMS functionality, this library is a good choice.
Dependencies
~23MB
~432K SLoC