17 stable releases (3 major)
4.0.1 | Mar 28, 2024 |
---|---|
3.9.0 | Nov 13, 2023 |
3.8.0 | Oct 31, 2023 |
3.5.2 | Sep 23, 2023 |
1.3.0 | May 21, 2023 |
#1099 in Database interfaces
43 downloads per month
195KB
5K
SLoC
kgdata
KGData is a library to process dumps of Wikipedia, Wikidata. What it can do:
- Clean up the dumps to ensure the data is consistent (resolve redirect, remove dangling references)
- Create embedded key-value databases to access entities from the dumps.
- Extract Wikidata ontology.
- Extract Wikipedia tables and convert the hyperlinks to Wikidata entities.
- Create Pyserini indices to search Wikidata’s entities.
- and more
For a full documentation, please see the website.
Installation
From PyPI (using pre-built binaries):
pip install kgdata[spark] # omit spark to manually specify its version if your cluster has different version
Dependencies
~47–77MB
~1.5M SLoC