7 releases

0.1.6 May 28, 2024
0.1.5 May 21, 2024
0.1.4 Mar 1, 2023
0.1.1 Feb 28, 2023

#737 in Web programming


Used in 3 crates (via progscrape-scrapers)

Apache-2.0 OR MIT

685KB
385 lines

urlnorm

Build Status docs.rs crates.io

URL normalization library for Rust, mainly designed to normalize URLs for https://progscrape.com.

The normalization algorithm uses the following heuristics:

  • The scheme of the URL is dropped, so that http://example.com and https://example.com are considered equivalent.
  • The host is normalized by dropping common prefixes such as www. and m..
  • The path is normalized by removing duplicate slashes and empty path segments, so that http://example.com//foo/ and http://example.com/foo are considered equivalent.
  • The query string parameters are sorted, and any analytics query parameters are removed (ie: utm_XYZ and the like).
  • Fragments are dropped, with the exception of certain fragment patterns that are recognized as significant (/#/ and #!)

Usage

For long-term storage and clustering of URLs, it is recommended that UrlNormalizer::compute_normalization_string is used to compute a representation of the URL that can be compared with standard string comparison operators.

The normalization strings are not a perfect clustering algorithm for content, but they will tend to cluster URLs pointing to the same data together. For a more accurate clustering algorithm, this library can be paired with a more advanced DUST-aware processing algorithm (for example, see DustBuster from "Do Not Crawl in the DUST: Different URLs with Similar Text").

# use url::Url;
# use urlnorm::UrlNormalizer;
let norm = UrlNormalizer::default();
let url = Url::parse("http://www.google.com").unwrap();
assert_eq!(norm.compute_normalization_string(&url), "google.com:");

For more advanced use cases, the Options class allows end-users to provide custom regular expressions for normalization.

Examples

The normalization string gives an idea of what parts of the URL are considered significant:

http://efekarakus.github.io/twitch-analytics/#/revenue
efekarakus.github.io:twitch-analytics:revenue:

http://fusion.net/story/121315/maybe-crickets-arent-the-food-of-the-future-after-all/?utm_source=facebook&utm_medium=social&utm_campaign=quartz
fusion.net:story:121315:maybe-crickets-arent-the-food-of-the-future-after-all:

http://www.capradio.org/news/npr/story?storyid=382276026
capradio.org:news:npr:story:storyid:382276026:

http://www.charlotteobserver.com/2015/02/23/5534630/charlotte-city-council-approves.html#.VOxrajTF91E
charlotteobserver.com:2015:02:23:5534630:charlotte-city-council-approves:

http://www.m.webmd.com/melanoma-skin-cancer/news/20150409/fewer-us-children-getting-melanoma-study?src=RSS_PUBLIC
webmd.com:melanoma-skin-cancer:news:20150409:fewer-us-children-getting-melanoma-study:src:RSS_PUBLIC:

Dependencies

~4–6MB
~108K SLoC