#web-crawler #spider #data-transformation #content #cleaning #chunking

spider_transformations

Transformation utils to use for Spider Web Crawler

324 stable releases

new 2.26.1 Jan 15, 2025
2.26.0 Jan 11, 2025
2.23.3 Dec 31, 2024
2.13.84 Nov 30, 2024
0.0.3 Sep 21, 2024

#1567 in Web programming

Download history 1767/week @ 2024-09-27 4055/week @ 2024-10-04 2060/week @ 2024-10-11 3973/week @ 2024-10-18 1621/week @ 2024-10-25 3899/week @ 2024-11-01 1234/week @ 2024-11-08 712/week @ 2024-11-15 7539/week @ 2024-11-22 4100/week @ 2024-11-29 3925/week @ 2024-12-06 1860/week @ 2024-12-13 1066/week @ 2024-12-20 1656/week @ 2024-12-27 2004/week @ 2025-01-03 1354/week @ 2025-01-10

6,136 downloads per month
Used in spider_utils

MIT license

205KB
5K SLoC

spider_transformations

The Rust spider cloud transformation library built for performance, AI, and multiple locales. The library is used on Spider Cloud for data cleaning.

Usage

[dependencies]
spider_transformations = "0"
use spider_transformations::transformation::content;

fn main() {
    // page comes from the spider object when streaming.
    let conf = content::TransformConfig::default();
    let content = content::transform_content(&page, &conf, &None, &None);
}

Transfrom types

  1. Markdown
  2. Commonmark
  3. Text
  4. Markdown (Text Map) or HTML2Text
  5. WIP: HTML2XML

Enhancements

  1. Readability
  2. Encoding

Chunking

There are several chunking utils in the transformation mod.

This project has rewrites and forks of html2md, and html2text for performance and bug fixes.

License

MIT

Dependencies

~22–37MB
~640K SLoC