#preprocessor #vendor #system #data #tktax #component

tktax-vendor

A vendor data preprocessing component for the TKTAX system

1 unstable release

0.2.2 Feb 1, 2025

#8 in #tktax

Download history 96/week @ 2025-01-27 39/week @ 2025-02-03

135 downloads per month
Used in 20 crates (2 directly)

MIT license

43KB
180 lines

README.md

TKTAX Vendor

This crate provides vendor-oriented text preprocessing for the TKTAX system. It parses textual input, segments it into tokens, and excludes terms based on a configurable stopword list. The functionality is especially helpful when generating standardized data for search, indexing, or lexical analysis (Lat. analytica lexica; Gr. λεξιλογική ανάλυση).

Features

  • Tokenization: Splits input on punctuation, whitespace, and numeric characters.
  • Stopword Filtering: Excludes generic terms (e.g., the, and, of) as well as region-specific identifiers (ny, va).
  • Minimal Token Length Threshold: Retains only words exceeding a specified length (default is 3).
  • Optional Morphological Transformations: Uncomment the stemmer logic (in preprocess_vendor_description) to enable morphological standardization (Gr. μορφολογία).

Usage Example

fn main() {
    let vendor_text = "Welcome to store 123 in New York (NY). We sell various items...";
    let tokens = tktax_vendor::preprocess_vendor_description(vendor_text);
    
    // tokens now holds an array of relevant, preprocessed words.
    // e.g. ["Welcome", "sell", "various", "items"]
}

Function: preprocess_vendor_description

/// Splits a vendor description string into filtered tokens.
/// - Strips punctuation, numeric data, and stopwords.
/// - Returns only tokens longer than 2 characters.
pub fn preprocess_vendor_description(s: &str) -> Vec<String> {
    // ...
}

Parameters

  • s: The raw vendor description text.

Returns

  • Vec: A set of filtered tokens.

Contributing

  1. Fork the repository and create a new branch for your feature or bugfix.
  2. Make your changes, ensuring they are well-tested and documented.
  3. Submit a pull request for review.

License

This project is licensed under the [MIT license](LICENSE).

Enjoy streamlined, efficient vendor data preprocessing with TKTAX Vendor!

Dependencies

~26–37MB
~635K SLoC