1.0.0 |
|
---|---|
0.1.5 |
|
#11 in #code-point
305KB
707 lines
UTF-8 Buffered Reader
This crate provides functions to read utf-8 text from any
type implementing io::BufRead
through a
trait, BufRead
, without waiting for newline
delimiters. These functions take advantage of buffering and
either return &
str
or char
s. Each has
an associated iterator, some have an equivalent to a
Map
iterator that avoids allocation and cloning as
well.
Usage
Add this crate as a dependency in your Cargo.toml
:
[dependencies]
utf8-bufread = "1.0.0"
The simplest way to read a file using this crate may be something along the following:
// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
match reader.read_str() {
Ok(s) => {
if s.is_empty() {
break; // EOF
}
// Do something with `s` ...
print!("{}", s);
}
Err(e) => {
// We should try again if we get interrupted
if e.kind() != ErrorKind::Interrupted {
break;
}
}
}
}
Reading arbitrary-length string slices
The read_str
function returns a
&
str
of arbitrary length (up to the reader's
buffer capacity) read from the inner reader, without cloning
data, unless a valid codepoint ends up cut at the end of the
reader's buffer. Its associated iterator can be obtained by
calling str_iter
, and since it involves
cloning the data at each iteration, str_map
is
also provided.
Reading codepoints
The read_char
function returns a
char
read from the inner reader. Its associated
iterator can be obtained by calling
char_iter
.
Iterator types
This crate provides several structs for several ways of iterating over the inner reader's data:
StrIter
andCodepointIter
clone the data on each iteration, but use anRc
to check if the returnedString
buffer is still used. If not, it is re-used to avoid re-allocating.
let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
// Do something with s ...
print!("{}", s);
}
StrMap
andCodepointMap
allow having access to read data without allocating nor copying, but then it cannot be passed to further iterator adapters.
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
.str_map(|s| s.len())
.filter_map(Result::ok)
.sum();
println!("There is {} valid utf-8 bytes in {}", count, s);
CharIter
is similar toStrIter
and others, except it relies onchar
s implementingCopy
and thus doesn't need a buffer nor the "Rc
trick".
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
.char_iter()
.filter_map(Result::ok)
.filter(|c| c.is_lowercase())
.count();
assert_eq!(count, 9);
All these iterators may read data until EOF or an invalid
codepoint is found. If valid codepoints are read from the
inner reader, they will be returned before reporting an
error. After encountering an error or EOF, they always
return None
(option). They always ignore any
Interrupted
error.
Work in progress
This crate is still a work in progress. Part of its API can be considered stable:
read_str
,read_codepoint
andread_char
's behavior and signature.str_iter
,str_map
,codepoints_iter
,codepoints_map
andchar_iter
's behavior and signature.StrIter
,StrMap
,CodepointIter
,CodepointMap
andCharIter
's API.
However some features are still considered unstable:
And some features still have to be added:
- A lossy and unchecked version of
read_*
(seefrom_utf8_lossy
&from_utf8_unchecked
). - (Optional) Support for grapheme clusters using the
unicode-segmentation
crate, in the same fashion asread_codepoint
. - I'm open to suggestion, if you have ideas 😉
Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests here
License
Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.