4 releases
Uses old Rust 2015
0.1.3 | Mar 24, 2017 |
---|---|
0.1.2 | Mar 20, 2017 |
0.1.1 | Mar 19, 2017 |
0.1.0 | Mar 19, 2017 |
#12 in #corpus
23 downloads per month
10KB
157 lines
opus_tools
: Miscellaneous tools for working with OPUS parallel corpus
These are small utilties for working with the OPUS parallel corpus, which is normally used for machine translation research. To install:
curl https://sh.rustup.rs -sSf | sh
cargo install opus_tools
opusraw2txt
: Extract raw text from raw, monolingual file
Download the file ca.raw.tar.gz
from the right-hand column of
the subtitle page and run:
opusraw2txt ca.raw.tar.gz
This will print a huge number of sentences on standard output in UTF-8 format for further processing.
If you want to process an entire directory of files, you could install GNU
parallel
and szip
, and run:
ls *.raw.tar.gz |
sed 's/\.raw\.tar\.gz$//' |
parallel --joblog out.log 'opusraw2txt {}.raw.tar.gz | szip > {}.sz'
This will rapidly extract a huge number of sentences:
Extracted 26782811 sentences from 27605 files.
Extracted 80140630 sentences from 90319 files.
Extracted 79320 sentences from 89 files.
Extracted 112360292 sentences from 124815 files.
Extracted 22917237 sentences from 23492 files.
Extracted 229583 sentences from 188 files.
Extracted 7335505 sentences from 6438 files.
Extracted 38677592 sentences from 44584 files.
Extracted 101502145 sentences from 114150 files.
...and so on.
If you see:
couldn't process OpenSubtitles2016/raw/es/2015/4544966/6155032.xml.gz (skipping):
Error: corrupt deflate stream
Error: couldn't process es.raw.tar.gz
Caused by: corrupt deflate stream
...this means that the file you downloaded was truncated before the end.
As far as I can tell, this affects that master copies of es.raw.tar.gz
and pt_br.raw.tar.gz
.
Contributing
Your feedback and contributions are welcome! For more information, see the subtitles-rs project.
Dependencies
~13–24MB
~427K SLoC