1 unstable release
new 0.1.0 | Mar 30, 2025 |
---|
#413 in Database interfaces
Used in 2 crates
3.5MB
3.5K
SLoC
tpchgen-rs
Blazing fast TPCH benchmark data generator in pure Rust !
Features
- Zero dependency TPCH data generator crate for easy embedding
- Blazing Speed (see below)
- Batteries included, multi-threaded CLI
Benchmarks
(coming soon)
Measuring Performance
This generator is so fast it can saturate the throughput of most IO devices at
time of writing. To see its true speed, you need to run it on a machine with a
fast IO device (SSD or NVMe). Alternately you can use the --stdout
flag and
the pv
command to send the output to /dev/null
and measure the throughput.
For example:
# Generate SF=100, about 100GB of data, piped to /dev/null, reporting statistics
tpchgen-cli -- -s 100 --stdout | pv -arb > /dev/null
# Reports something like
# 106GiB [3.09GiB/s] (3.09GiB/s)
Similarly for parquet
# Generate SF=100 in parquet format, piped to /dev/null, reporting statistics
tpchgen-cli -- -s 100 --format=parquet --stdout | pv -arb > /dev/null
# 38.2GiB [ 865MiB/s] ( 865MiB/s)
Structure
tpchgen-cli
is a dbgen
compatible CLI tool
that generates tables from the TPCH benchmark dataset.
tpchgen
is the library that implements the data generation logic for TPCH and it can be
used to embed data generation logic natively in Rust.
CLI Usage
We tried to make the tpchgen-cli
experience as close to dbgen
as possible for no other
reason than maybe make it easier for you to have a drop-in replacement.
$ tpchgen-cli -h
TPC-H Data Generator
Usage: tpchgen-cli [OPTIONS] --output-dir <OUTPUT_DIR>
Options:
-s, --scale-factor <SCALE_FACTOR> Scale factor to address defaults to 1 [default: 1]
-o, --output-dir <OUTPUT_DIR> Output directory for generated files
-t, --tables <TABLES> Which tables to generate (default: all) [possible values: nation, region, part, supplier, part-supp, customer, orders, line-item]
-p, --parts <PARTS> Number of parts to generate (for parallel generation) [default: 1]
--part <PART> Which part to generate (1-based, only relevant if parts > 1) [default: 1]
-h, --help Print help
For example generating a dataset with a scale factor of 1 (1GB) can be done like this :
$ tpchgen-cli -s 1 --output-dir=/tmp/tpch
Contributing
Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.
Architecture
Please see architecture guide for details on how the code is structured.
License
The project is licensed under the APACHE 2.0 license.