#overlap #text #chunking #default

bin+lib overlap-chunk

A Rust library for splitting text into chunks of specified size with adjustable overlap percentage

2 releases

new 0.0.2 Mar 3, 2025
0.0.1 Feb 26, 2025

#354 in Text processing

Download history 93/week @ 2025-02-21 151/week @ 2025-02-28

244 downloads per month

MIT license

12KB
197 lines

overlap-chunk

A Rust library for splitting text into chunks of specified size with adjustable overlap percentage.

Features

Current Features

  • Basic functionality to split text into chunks of specified size
  • Option to adjust the overlap percentage between chunks
  • Command-line interface for easy text processing

Future Features

  • Chunking that respects word boundaries and sentence boundaries
  • Support for multilingual text
  • Support for streaming input

Library Usage

use overlap_chunk::ChunkOptions;
use overlap_chunk::chunk_text;

fn main() {
    let text = "This is a test text. We will split this long text into smaller chunks.";
    
    // Chunk splitting with default options (no overlap)
    let chunks = chunk_text(text, 10, None);
    println!("{:?}", chunks);
    
    // Chunk splitting with overlap (50% overlap)
    let options = ChunkOptions {
        overlap_percentage: 50,
        ..Default::default()
    };
    let chunks_with_overlap = chunk_text(text, 10, Some(options));
    println!("{:?}", chunks_with_overlap);
}

Command Line Usage

The library includes a command-line interface for processing text files:

Usage: overlap-chunk [OPTIONS] [FILE]
  If no file is specified, read from standard input

Options:
  -h, --help              Display this help message
  -s, --size SIZE         Specify chunk size (default: 100)
  -o, --overlap PERCENT   Specify overlap percentage between 0 and 90 (default: 0)

Examples

Process a file with default settings:

overlap-chunk myfile.txt

Process a file with custom chunk size and overlap:

overlap-chunk -s 50 -o 30 myfile.txt

Process standard input:

cat myfile.txt | overlap-chunk -s 50

License

MIT License

No runtime deps