#parser #sitemap #crawler

sitemapo

The implementation of the Sitemap.xml (or URL inclusion) protocol with the support of txt & xml formats, and video, image, news extensions

2 unstable releases

0.2.0 Jul 31, 2023
0.1.0 Jul 15, 2023

#1680 in Parser implementations

MIT license

68KB
1.5K SLoC

sitemapo

Build Status Crate Docs Crate Version Crate Coverage

Also check out other xwde projects here.

The implementation of the Sitemap (or URL inclusion) protocol in the Rust programming language with the support of txt & xml formats, and video, image, news extensions (according to the Google's spec).

Features

  • extension to enable all XML sitemap extensions. Enabled by default.
  • tokio to enable asynchronous parsers & builders.

Examples

  • automatic parser: AutoParser.
#[derive(Debug, thiserror::Error)]
enum CustomError {
    // ..
    #[error("sitemap error: {0}")]
    Sitemap(#[from] sitemapo::Error),
    //..
}

fn main() -> Result<(), CustomError> {
    type SyncReader = std::io::BufReader<std::io::Cursor<Vec<u8>>>;
    fn fetch(_: url::Url) -> Result<SyncReader, CustomError> {
        // ..
        unreachable!()
    }

    let sitemaps = Vec::default(); // Sitemaps listed in the robots.txt file.
    let mut parser = sitemapo::AutoParser::new_sync(&sitemaps, fetch);
    while let Some(_record) = parser.read_sync()? {
        // ..
    }

    Ok(())
}
  • parsers: TxtParser & XmlParser.
use sitemapo::{
    parse::{Parser, TxtParser},
    Error,
};

fn main() -> Result<(), Error> {
    let buf = "https://example.com/file1.html".as_bytes();

    let mut parser = TxtParser::new(buf)?;
    let _rec = parser.read()?;
    let _buf = parser.close()?;
    Ok(())
}
  • builders: TxtBuilder & XmlBuilder.
use sitemapo::{
    build::{Builder, XmlBuilder},
    record::EntryRecord,
    Error,
};

fn main() -> Result<(), Error> {
    let buf = Vec::new();
    let rec = EntryRecord::new("https://example.com/".try_into()?);

    let mut builder = XmlBuilder::new(buf)?;
    builder.write(&rec)?;
    let _buf = builder.close()?;
    Ok(())
}

Notes

  • Extensions are not yet implemented.
  • AutoParser does not yet support txt sitemaps.

Crates

Dependencies

~4–12MB
~137K SLoC