2 releases
0.1.1 | Apr 2, 2023 |
---|---|
0.1.0 | Apr 2, 2023 |
#8 in #wikimedia
21 downloads per month
235KB
5K
SLoC
wikimedia
crates (including wmd
CLI tool)
Open source Rust libraries and tools for downloading and viewing data from Wikimedia Foundation, the non-profit behind Wikipedia and other projects.
There are 3 related crates in the wikimedia-rs
source repository:
wikimedia
: library to download and parse data from Wikimedia.
Crate | Documentationwikimedia-store
: library to store MediaWiki pages, supporting search and import from Wikimedia dump files.
Crate | Documentationwikimedia-download
: CLI toolwmd
to download data from Wikimedia and view it over a web interface.
Crate
To install
Install the Rust toolchain. rustup
is the
standard tool to do this.
These crates are primarily developed and tested on Linux. They are
tested occasionally on macOS. The utility scripts in bin/
are
written in bash
, and all further instructions will assume you are
running on a Unix-based machine.
The dependencies selected are cross-platform so hopefully would work on a Windows machine, but this hasn't been tested at all. If you want to run on Windows, try using WSL or Cygwin. Pull requests welcome if you want to add support!
Run this command to build and install wmd
with
Rust's package manager cargo
:
RUSTFLAGS="--cfg tracing_unstable" cargo +nightly install wikimedia-download
By default this will install wmd
to ~/.cargo/bin/wmd
, make sure this is on
your shell's path.
Viewing the downloaded pages requires pandoc
on your executable path
to convert the MediaWiki Wikitext markup to HTML.
See their releases download page,
and their installation instructions page.
Quick start
To show help instructions for wmd
's subcommands:
# Help for `wmd` and its list of subcommands
wmd help
# Help for the subcommand `download` use one of these:
wmd help download
wmd download --help
To download the text for the latest version of all articles on English Wikipedia (about 20 GB to download as of 2023-03-20):
export WMD_MIRROR_URL=https://ftp.acc.umu.se/mirror/wikimedia.org/dumps
wmd download --dump enwiki \
--version latest \
--job articlesdump
Example download finished message:
Downloading job files complete
| download_dir = /home/alex/wm/out/dumps/enwiki/20230320/articlesdump
| dump = enwiki
| version = 20230320
| job = articlesdump
By default the files will be downloaded to
~/.local/share/wmd
on Linux~/Library/Application Support/wmd
on macOSC:\Users\%USERNAME%\AppData\Local\wmd
on Windows
This can be overriden with the environment variable WMD_OUT_DIR
or
CLI argument --out-dir
; see wmd help download
for more
information.
If some files have already been downloaded, their checksums will be verified and if correct they will not be downloaded again.
Wikimedia's main data dump download server
limits the number of network connections and download rate, so it's recommended to choose a dump mirror to download from. Official mirrors are listed on these pages:
1
2.
The example in the script above is the one I've been using to test wmd
; it's located in Sweden, which is geographically close to me.
To easily retrieve the articles they must be imported into wmd
's store:
wmd import-dump --dump enwiki \
--version 20230320 \
--job articlesdump
Use the same dump and job you downloaded earlier, and the version that wmd download
reported.
An import of the latest version of all articles on English Wikipedia will occupy about 80 GB of disk storage. This is larger than the download size because the store is currently not compressed, but this is planned.
Once the import command is done, you can view the downloaded pages in the web interface:
wmd web
Example output:
2023-04-02T22:25:08.86645692Z INFO wmd::commands::web: Listening on http, url: http://localhost:8089/
at crates/wikimedia-download/src/commands/web.rs:89 on ThreadId(1)
Visit the URL in the log message: http://localhost:8089
.
Set the environment varible RUST_LOG
to configure logging levels and filtering. This application uses the tracing-subscriber
crate for logging, see their documentation for the available logging configuration directives. Note that many of these directives can be supplied separated by commas.
Shell completion setup
The currently supported shells are: bash, elvish, fish, powershell, and zsh.
Save a completion file for your shell with wmd completion
, for example for zsh:
wmd completion --shell zsh > completion.zsh
Then follow your shell's instructions to load the file.
wmd
's argument parsing is implemented with the clap
crate,
and shell completion files are generated with the
clap_complete
crate.
Development instructions
- Clone the source code with git:
git clone https://github.com/fluffysquirrels/wikimedia-rs.git
- Build with the script
bin/build
. This builds in release mode, which is strongly recommended. In debug mode without optimisations some of the data processing will be very slow. - Run tests with
bin/test
. - Build then run with
bin/wmd
.
I recommend adding a symlink to this file on your path. On my machine~/bin
is on my path, so I just needed to runln -s ${PWD}/bin/wmd ~/bin
.
Store pages are encoded using Cap'n Proto's
Rust implementation capnproto-rust
.
This requires generating accessor code (currently in crates/wikimedia-store/capnp/generated
)
from .capnp
schema files (currently in crates/wikimedia-store/capnp/
).
The accessor code is checked into the source code repository so the
generator la tools do not need to be installed to build the code when
using cargo install
or the build scripts, but if the schema files
are modified the build scripts will detect this and regenerate the
accessor code automatically.
You must install these dependencies on your executable path to regenerate the accessor code:
capnp
, the Cap'n Proto schema compiler; see the install instructions.capnpc-rust
, thecapnp
Rust plugin. Install it withcargo install capnpc
.
Dependencies
~55–74MB
~1.5M SLoC