1 unstable release
new 0.1.0 | Jan 15, 2025 |
---|
#895 in Text processing
27KB
429 lines
TextFrame
TextFrame is a low-level Rust library to access plain text files, including plain-text corpora of considerable size. Texts do not have to be accessed and loaded into memory in their entirety, but arbitrary sub-parts are loaded on-demand. Requests are formulated with offsets in unicode character offsets,
Features
This library takes care of mapping these to byte offsets (UTF-8) and loading the corresponding excerpt of the file from disk into memory. We call such an excerpt a text frame. Multiple discontinuous or partially overlapping text-frames might be loaded. Frames are only loaded from disk if no already loaded frame covers the offsets.
Negative values in offsets are supported and are interpreted as relative to the end of the document. This also applies to 0 as an end offset. All end offsets are non-inclusive. An offset of (0,0)
by definition covers the entire text document.
- This library considers text as an immutable resource, text files on disk MUST NOT be modified after a
textframe::TextFile
object is associated with them. - The mutability of
textframe::TextFile
itself only refers to the fact whether it is allowed to load further fragments from disk or not. - When loading a text file, the entire text file is read in a streaming manner at first and an index is computed from unicode character positions to byte positions. This index can be written to a (binary) file which acts as a cache, preventing the need to recompute this index next time, and gaining a performance benefit.
- Existing frames are never unloaded or invalidated. Any text references (
&str
) therefore share the lifetime of thetextframe::TextFile
object. Depending on the order of requests, it does mean the loaded frames may have some overlap and be sub-optimal.
Installation
Add it to your Rust project as follows:
cargo add textframe
Usage
Example:
use textframe::TextFile;
let mut textfile = TextFile::new("/tmp/test.txt", None).expect("file must load");
//gets the text from 10 to 20 (unicode points), requires a mutable instance
let text: &str = textfile.get_or_load(10,20);
//once a frame is already loaded, you can use this instead, works on an immutable instance:
let text: &str = textfile.get(10,20);
Related projects
- textsurf - A WebAPI around textframe. Serves text files over the web.
Licence
GNU General Public Licence v3 only
Dependencies
~0.5–1MB
~23K SLoC