1 unstable release
0.2.2 | Feb 1, 2025 |
---|
#8 in #tktax
135 downloads per month
Used in 20 crates
(2 directly)
43KB
180 lines
README.md
TKTAX Vendor
This crate provides vendor-oriented text preprocessing for the TKTAX system. It parses textual input, segments it into tokens, and excludes terms based on a configurable stopword list. The functionality is especially helpful when generating standardized data for search, indexing, or lexical analysis (Lat. analytica lexica; Gr. λεξιλογική ανάλυση).
Features
- Tokenization: Splits input on punctuation, whitespace, and numeric characters.
- Stopword Filtering: Excludes generic terms (e.g.,
the
,and
,of
) as well as region-specific identifiers (ny
,va
). - Minimal Token Length Threshold: Retains only words exceeding a specified length (default is 3).
- Optional Morphological Transformations: Uncomment the stemmer logic (in
preprocess_vendor_description
) to enable morphological standardization (Gr. μορφολογία).
Usage Example
fn main() {
let vendor_text = "Welcome to store 123 in New York (NY). We sell various items...";
let tokens = tktax_vendor::preprocess_vendor_description(vendor_text);
// tokens now holds an array of relevant, preprocessed words.
// e.g. ["Welcome", "sell", "various", "items"]
}
Function: preprocess_vendor_description
/// Splits a vendor description string into filtered tokens.
/// - Strips punctuation, numeric data, and stopwords.
/// - Returns only tokens longer than 2 characters.
pub fn preprocess_vendor_description(s: &str) -> Vec<String> {
// ...
}
Parameters
- s: The raw vendor description text.
Returns
- Vec: A set of filtered tokens.
Contributing
- Fork the repository and create a new branch for your feature or bugfix.
- Make your changes, ensuring they are well-tested and documented.
- Submit a pull request for review.
License
This project is licensed under the [MIT license](LICENSE).
Enjoy streamlined, efficient vendor data preprocessing with TKTAX Vendor!
Dependencies
~26–37MB
~635K SLoC