#xml-parser #csv #website #downloaded #uni-parc #relational #million

app biodata-parsers

Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files

1 unstable release

Uses old Rust 2015

0.1.0 Jan 15, 2017

#10 in #downloaded

MIT license

46KB
1K SLoC

Rust 518 SLoC // 0.0% comments Python 502 SLoC // 0.1% comments

UniParc XML parser

Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.

Example

Parsing 1 million lines takes about 5.5 seconds:

$ mkdir uniparc
$ time bash -c "zcat tests/uniparc_1mil.xml.gz | uniparc_xml_parser >/dev/null"

real    0m5.564s
user    0m5.528s
sys     0m0.132s

The actual uniparc_all.xml.gz file is about 5 billion rows.

Dependencies

~405KB