#xml-parser #parser #xml-format #swissprot #trembl #uniref #uniprotkb

uniprot

Rust data structures and parser for the Uniprot database(s)

11 releases (6 breaking)

0.7.0 Oct 18, 2023
0.6.0 Oct 17, 2022
0.5.2 Feb 28, 2022
0.5.1 Jan 11, 2022
0.1.1 Jan 15, 2020

#199 in Biology

MIT license

195KB
4.5K SLoC

uniprot.rs Star me

Rust data structures and parser for the UniprotKB database(s).

Actions Codecov License Source Crate Documentation Changelog GitHub issues

🔌 Usage

The uniprot::uniprot::parse function can be used to obtain an iterator over the entries of a UniprotKB database in XML format (either SwissProt or TrEMBL). XML files for UniRef and UniParc can also be parsed, with uniprot::uniref::parse and uniprot::uniparc::parse, respectively.

extern crate uniprot;

let f = std::fs::File::open("tests/uniprot.xml")
   .map(std::io::BufReader::new)
   .unwrap();

for r in uniprot::uniprot::parse(f) {
   let entry = r.unwrap();
   // ... process the Uniprot entry ...
}

Any BufRead implementor can be used as an input, so the database files can be streamed directly from their online location with the help of an HTTP library such as reqwest, or using the ftp library.

The XML format is the same for the EBI REST API and for the UniProt API, so this library can also be used to read single entries or larger queries. For instance, you can search UniProt for a keyword and retrieve all the matching entries:

extern crate ureq;
extern crate libflate;
extern crate uniprot;

let query = "bacteriorhodopsin";
let query_url = format!("https://www.uniprot.org/uniprot/?query={}&format=xml&compress=yes", query);

let req = ureq::get(&query_url).set("Accept", "application/xml");
let reader = libflate::gzip::Decoder::new(req.call().unwrap().into_reader()).unwrap();

for r in uniprot::uniprot::parse(std::io::BufReader::new(reader)) {
    let entry = r.unwrap();
    // ... process the Uniprot entry ...
}

See the online documentation at docs.rs for more examples, and some details about the different features available.

📝 Features

  • threading (enabled by default): compiles the multithreaded parser that offers a 90% speed increase when processing XML files.
  • url-links (disabled by default): exposes the links in OnlineInformation as an url::Url.

🔍 See Also

If you're a bioinformatician and a Rustacean, you may be interested in these other libraries:

  • pubchem.rs: Rust data structures and API client for the PubChem API.
  • obofoundry.rs: Rust data structures for the OBO Foundry.
  • fastobo: Rust parser and abstract syntax tree for Open Biomedical Ontologies.
  • proteinogenic: Chemical structure generation for protein sequences as SMILES strings.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

📜 License

This library is provided under the open-source MIT license.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the UniProt Consortium. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

Dependencies

~2.3–3MB
~52K SLoC