#read #io #string #code-point #avoid

yanked utf8-bufread

Provides alternatives to BufRead's read_line & lines that stop not on newlines

1.0.0 Apr 5, 2021
0.1.5 Mar 14, 2021

#11 in #code-point

Apache-2.0

305KB
707 lines

UTF-8 Buffered Reader

This crate provides functions to read utf-8 text from any type implementing io::BufRead through a trait, BufRead, without waiting for newline delimiters. These functions take advantage of buffering and either return &str or chars. Each has an associated iterator, some have an equivalent to a Map iterator that avoids allocation and cloning as well.

crates.io docs.rs build status

Usage

Add this crate as a dependency in your Cargo.toml:

[dependencies]
utf8-bufread = "1.0.0"

The simplest way to read a file using this crate may be something along the following:

// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
    match reader.read_str() {
        Ok(s) => {
            if s.is_empty() {
                break; // EOF
            }
            // Do something with `s` ...
            print!("{}", s);
        }
        Err(e) => {
            // We should try again if we get interrupted
            if e.kind() != ErrorKind::Interrupted {
                break;
            }
        }
    }
}

Reading arbitrary-length string slices

The read_str function returns a &str of arbitrary length (up to the reader's buffer capacity) read from the inner reader, without cloning data, unless a valid codepoint ends up cut at the end of the reader's buffer. Its associated iterator can be obtained by calling str_iter, and since it involves cloning the data at each iteration, str_map is also provided.

Reading codepoints

The read_char function returns a char read from the inner reader. Its associated iterator can be obtained by calling char_iter.

Iterator types

This crate provides several structs for several ways of iterating over the inner reader's data:

  • StrIter and CodepointIter clone the data on each iteration, but use an Rc to check if the returned String buffer is still used. If not, it is re-used to avoid re-allocating.
let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
    // Do something with s ...
    print!("{}", s);
}
  • StrMap and CodepointMap allow having access to read data without allocating nor copying, but then it cannot be passed to further iterator adapters.
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
    .str_map(|s| s.len())
    .filter_map(Result::ok)
    .sum();
println!("There is {} valid utf-8 bytes in {}", count, s);
  • CharIter is similar to StrIter and others, except it relies on chars implementing Copy and thus doesn't need a buffer nor the "Rc trick".
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
    .char_iter()
    .filter_map(Result::ok)
    .filter(|c| c.is_lowercase())
    .count();
assert_eq!(count, 9);

All these iterators may read data until EOF or an invalid codepoint is found. If valid codepoints are read from the inner reader, they will be returned before reporting an error. After encountering an error or EOF, they always return None(option). They always ignore any Interrupted error.

Work in progress

This crate is still a work in progress. Part of its API can be considered stable:

However some features are still considered unstable:

  • Error's behavior, particularly regarding its kind and how it avoids data loss (see leftovers).

And some features still have to be added:

Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.

No runtime deps