#embedding #below #tpch #tpchgen-rs #dbgen #database-benchmarking

tpchgen

Blazing fast pure Rust no dependency TPC-H data generation library

1 unstable release

new 0.1.0 Mar 30, 2025

#413 in Database interfaces


Used in 2 crates

Apache-2.0

3.5MB
3.5K SLoC

tpchgen-rs

Apache licensed Build Status

Blazing fast TPCH benchmark data generator in pure Rust !

Features

  1. Zero dependency TPCH data generator crate for easy embedding
  2. Blazing Speed (see below)
  3. Batteries included, multi-threaded CLI

Benchmarks

(coming soon)

Measuring Performance

This generator is so fast it can saturate the throughput of most IO devices at time of writing. To see its true speed, you need to run it on a machine with a fast IO device (SSD or NVMe). Alternately you can use the --stdout flag and the pv command to send the output to /dev/null and measure the throughput.

For example:

# Generate SF=100, about 100GB of data, piped to /dev/null, reporting statistics 
tpchgen-cli -- -s 100 --stdout | pv -arb > /dev/null
# Reports something like 
# 106GiB [3.09GiB/s] (3.09GiB/s)

Similarly for parquet

# Generate SF=100 in parquet format, piped to /dev/null, reporting statistics
tpchgen-cli -- -s 100 --format=parquet --stdout | pv -arb > /dev/null
# 38.2GiB [ 865MiB/s] ( 865MiB/s)

Structure

tpchgen-cli is a dbgen compatible CLI tool that generates tables from the TPCH benchmark dataset.

tpchgen is the library that implements the data generation logic for TPCH and it can be used to embed data generation logic natively in Rust.

CLI Usage

We tried to make the tpchgen-cli experience as close to dbgen as possible for no other reason than maybe make it easier for you to have a drop-in replacement.

$ tpchgen-cli -h
TPC-H Data Generator

Usage: tpchgen-cli [OPTIONS] --output-dir <OUTPUT_DIR>

Options:
  -s, --scale-factor <SCALE_FACTOR>  Scale factor to address defaults to 1 [default: 1]
  -o, --output-dir <OUTPUT_DIR>      Output directory for generated files
  -t, --tables <TABLES>              Which tables to generate (default: all) [possible values: nation, region, part, supplier, part-supp, customer, orders, line-item]
  -p, --parts <PARTS>                Number of parts to generate (for parallel generation) [default: 1]
      --part <PART>                  Which part to generate (1-based, only relevant if parts > 1) [default: 1]
  -h, --help                         Print help

For example generating a dataset with a scale factor of 1 (1GB) can be done like this :

$ tpchgen-cli -s 1 --output-dir=/tmp/tpch

Contributing

Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.

Architecture

Please see architecture guide for details on how the code is structured.

License

The project is licensed under the APACHE 2.0 license.

References

  • The TPC-H Specification, see the specification page.
  • The Original dbgen Implementation you must submit an official request to access the software dbgen at their official website

No runtime deps