10 releases (5 breaking)

0.7.0 Mar 31, 2020
0.6.0 Jan 7, 2019
0.5.5 May 10, 2018
0.5.3 Feb 24, 2018
0.5.1 Apr 8, 2017

#810 in Web programming

MIT/Apache

30KB
771 lines

crates.io crates.io

Build Status Minimal rust version 1.36 Nightly rust version from March 30, 2020

robots_txt

robots_txt is a lightweight robots.txt parser and generator for robots.txt written in Rust.

Nothing extra.

Unstable

The implementation is WIP.

Installation

Robots_txt is available on crates.io and can be included in your Cargo enabled project like this:

Cargo.toml:

[dependencies]
robots_txt = "0.7"

Parsing & matching paths against rules

use robots_txt::Robots;

static ROBOTS: &'static str = r#"

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:

"#;

fn main() {
    let robots = Robots::from_str(ROBOTS);

    let matcher = SimpleMatcher::new(&robots.choose_section("NoName Bot").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(!matcher.check_path("/cyberworld/map/object.html"));

    let matcher = SimpleMatcher::new(&robots.choose_section("Mozilla/5.0; CyberMapper v. 3.14").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(matcher.check_path("/cyberworld/map/object.html"));
}

Building & rendering

main.rs:

extern crate robots_txt;

use robots_txt::Robots;

fn main() {
    let robots1 = Robots::builder()
        .start_section("cybermapper")
            .disallow("")
            .end_section()
        .start_section("*")
            .disallow("/cyberworld/map/")
            .end_section()
        .build();

    let conf_base_url: Url = "https://example.com/".parse().expect("parse domain");
    let robots2 = Robots::builder()
        .host(conf_base_url.domain().expect("domain"))
        .start_section("*")
            .disallow("/private")
            .disallow("")
            .crawl_delay(4.5)
            .request_rate(9, 20)
            .sitemap("http://example.com/sitemap.xml".parse().unwrap())
            .end_section()
        .build();
        
    println!("# robots.txt for http://cyber.example.com/\n\n{}", robots1);
    println!("# robots.txt for http://example.com/\n\n{}", robots2);
}

As a result we get

# robots.txt for http://cyber.example.com/

User-agent: cybermapper
Disallow:

User-agent: *
Disallow: /cyberworld/map/


# robots.txt for http://example.com/

User-agent: *
Disallow: /private
Disallow:
Crawl-delay: 4.5
Request-rate: 9/20
Sitemap: http://example.com/sitemap.xml

Host: example.com

Alternatives

License

Licensed under either of

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~2.2–3MB
~58K SLoC