1 unstable release

0.2.2	Feb 1, 2025

#8 in #tktax

120 downloads per month
Used in 20 crates (2 directly)

MIT license

43KB
180 lines

README.md

TKTAX Vendor

This crate provides vendor-oriented text preprocessing for the TKTAX system. It parses textual input, segments it into tokens, and excludes terms based on a configurable stopword list. The functionality is especially helpful when generating standardized data for search, indexing, or lexical analysis (Lat. analytica lexica; Gr. λεξιλογική ανάλυση).

Features

Tokenization: Splits input on punctuation, whitespace, and numeric characters.
Stopword Filtering: Excludes generic terms (e.g., the, and, of) as well as region-specific identifiers (ny, va).
Minimal Token Length Threshold: Retains only words exceeding a specified length (default is 3).
Optional Morphological Transformations: Uncomment the stemmer logic (in preprocess_vendor_description) to enable morphological standardization (Gr. μορφολογία).

Usage Example

fn main() {
    let vendor_text = "Welcome to store 123 in New York (NY). We sell various items...";
    let tokens = tktax_vendor::preprocess_vendor_description(vendor_text);
    
    // tokens now holds an array of relevant, preprocessed words.
    // e.g. ["Welcome", "sell", "various", "items"]
}

Function: `preprocess_vendor_description`

/// Splits a vendor description string into filtered tokens.
/// - Strips punctuation, numeric data, and stopwords.
/// - Returns only tokens longer than 2 characters.
pub fn preprocess_vendor_description(s: &str) -> Vec<String> {
    // ...
}

Parameters

s: The raw vendor description text.

Returns

Vec: A set of filtered tokens.

Contributing

Fork the repository and create a new branch for your feature or bugfix.
Make your changes, ensuring they are well-tested and documented.
Submit a pull request for review.

License

This project is licensed under the [MIT license](LICENSE).

Enjoy streamlined, efficient vendor data preprocessing with TKTAX Vendor!

Dependencies

~26–37MB
~638K SLoC