45 releases (18 breaking)

0.22.3	Feb 16, 2025
0.21.1	Jan 24, 2025
0.20.0	Dec 15, 2024
0.19.0	Sep 18, 2024
0.1.4	Mar 29, 2021

#276 in Compression

131 downloads per month

MIT/Apache

26KB
432 lines

CSV to Parquet

Convert CSV files to Apache Parquet. This package is part of Arrow CLI tools.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/arrow-tools/releases.

With Homebrew

brew install domoritz/homebrew-tap/csv2parquet

With Cargo

cargo install csv2parquet

With Cargo B(inary)Install

To avoid re-compilation and speed up installation, you can install this tool with cargo binstall:

cargo binstall csv2parquet

Usage

Usage: csv2parquet [OPTIONS] <CSV> <PARQUET>

Arguments:
  <CSV>
          Input CSV file, stdin if not present

  <PARQUET>
          Output file

Options:
  -s, --schema-file <SCHEMA_FILE>
          File with Arrow schema in JSON format

      --max-read-records <MAX_READ_RECORDS>
          The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

      --header <HEADER>
          Set whether the CSV file has headers

          [default: true]
          [possible values: true, false]

      --delimiter <DELIMITER>
          Set the CSV file's column delimiter as a byte character

      --escape <ESCAPE>
          Specify an escape character

      --quote <QUOTE>
          Specify a custom quote character

      --comment <COMMENT>
          Specify a comment character.

          Lines starting with this character will be ignored

      --null-regex <NULL_REGEX>
          Provide a regex to match null values

  -c, --compression <COMPRESSION>
          Set the compression

          [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw]

  -e, --encoding <ENCODING>
          Sets encoding for any column

          [possible values: plain, plain-dictionary, rle, rle-dictionary, delta-binary-packed, delta-length-byte-array, delta-byte-array, byte-stream-split]

      --data-page-size-limit <DATA_PAGE_SIZE_LIMIT>
          Sets data page size limit

      --dictionary-page-size-limit <DICTIONARY_PAGE_SIZE_LIMIT>
          Sets dictionary page size limit

      --write-batch-size <WRITE_BATCH_SIZE>
          Sets write batch size

      --max-row-group-size <MAX_ROW_GROUP_SIZE>
          Sets max size for a row group

      --created-by <CREATED_BY>
          Sets "created by" property

      --dictionary <DICTIONARY>
          Sets flag to enable/disable dictionary encoding for any column

          [possible values: true, false]

      --statistics <STATISTICS>
          Sets flag to enable/disable statistics for any column

          [possible values: none, chunk, page]

  -p, --print-schema
          Print the schema to stderr

  -n, --dry
          Only print the schema

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

The --schema-file option uses the same file format as --dry and --print-schema.

Examples

Convert a CSV to Parquet

csv2parquet data.csv data.parquet

Convert a CSV with no `header` to Parquet

csv2parquet --header false <CSV> <PARQUET>

Get the `schema` from a CSV with header

csv2parquet --header true --dry <CSV> <PARQUET>

Convert a CSV using `schema-file` to Parquet

Below is an example of the schema-file content:

{
  "fields": [
    {
      "name": "col1",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": " col2",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  " metadata": {}
}

Then add the schema-file schema.json in the command:

csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>

Convert streams piping from standard input to standard output

This technique can prevent you from writing large files to disk. For example, here we stream a CSV file from a URL to S3.

curl <FILE_URL> | csv2parquet /dev/stdin /dev/stdout | aws s3 cp - <S3_DESTINATION>

Dependencies

~32MB
~725K SLoC