108 releases (29 stable)

6.3.1 Nov 13, 2024
6.1.1 Sep 20, 2024
6.0.0 Mar 17, 2024
5.1.2 Feb 22, 2024
0.4.0 Nov 21, 2020

#7 in Command line utilities

Download history 40/week @ 2024-08-02 397/week @ 2024-08-09 136/week @ 2024-08-16 64/week @ 2024-08-23 36/week @ 2024-08-30 241/week @ 2024-09-06 340/week @ 2024-09-13 551/week @ 2024-09-20 76/week @ 2024-09-27 151/week @ 2024-10-04 70/week @ 2024-10-11 44/week @ 2024-10-18 240/week @ 2024-10-25 84/week @ 2024-11-01 208/week @ 2024-11-08 94/week @ 2024-11-15

634 downloads per month

MIT license

165KB
3.5K SLoC

ODBC to Parquet

Licence Crates.io

A command line tool to query an ODBC data source and write the result into a parquet file.

  • Small memory footprint. Only holds one batch at a time in memory.
  • Fast. Makes efficient use of ODBC bulk reads, to lower IO overhead.
  • Flexible. Query any ODBC data source you have a driver for. MySQL, MS SQL, Excel, ...

Mapping of types in queries

The tool queries the ODBC Data source for type information and maps it to parquet type as such:

ODBC SQL Type Parquet Type
Decimal(p < 39, s) Decimal(p,s)
Numeric(p < 39, s) Decimal(p,s)
Bit Boolean
Double Double
Real Float
Float(p: 0..24) Float
Float(p >= 25) Double
Tiny Integer Signed Int8 Signed
Tiny Integer Unsigned Int8 Unsigned
Small Integer Int16
Integer Int32
Big Int Int64
Date Date
Time(p: 0..3)* Time Milliseconds
Time(p: 4..6)* Time Microseconds
Time(p: 7..9)* Time Nanoseconds
Timestamp(p: 0..3) Timestamp Milliseconds
Timestamp(p: 4..6) Timestamp Microseconds
Timestamp(p >= 7) Timestamp Nanoseconds
Datetimeoffset(p: 0..3) Timestamp Milliseconds (UTC)
Datetimeoffset(p: 4..6) Timestamp Microseconds (UTC)
Datetimeoffset(p >= 7) Timestamp Nanoseconds (UTC)
Varbinary Byte Array
Long Varbinary Byte Array
Binary Fixed Length Byte Array
All others Utf8 Byte Array

p is short for precision. s is short for scale. Intervals are inclusive.

  • Time is only supported for Microsoft SQL Server

Installation

Prerequisites

To work with this tool you need an ODBC driver manager and an ODBC driver for the data source you want to access.

Windows

An ODBC driver manager is already preinstalled on windows. So is the ODBC data sources (64Bit) and ODBC data sources (32Bit) app which you can use to discover which drivers are already available on your system.

Linux

This tool links both at runtime and during build against libodbc.so. To get it you should install unixODBC. You can do this using your systems packet manager. For ubuntu you run:

sudo apt install unixodbc-dev

OS-X

This tool links both at runtime and during build against libodbc.so. To get it you should install unixODBC. To install it I recommend the homebrew packet manager, which allows you to install it using:

brew install unixodbc

Via scoop package manager

If you have scoop package manager installed (Windows only), you can install this with:

scoop install odbc2parquet

Download binary from GitHub

https://github.com/pacman82/odbc2parquet/releases/latest

Note: Download the 32 Bit version if you want to connect to data sources using 32 Bit drivers and download the 64 Bit version if you want to connect via 64 Bit drivers. It won't work vice versa.

Via Cargo

If you have a rust tool chain installed, you can install this tool via cargo.

cargo install odbc2parquet

Build in docker from scratch

FROM rust:alpine AS builder

RUN apk add --no-cache musl-dev unixodbc-static
# In addition to unixodbc you also want to install the database drivers you need and `COPY` them over to the `runner`

WORKDIR /src/odbc2parquet
COPY . .
RUN cargo build --release

FROM scratch AS runner

COPY --from=builder /src/odbc2parquet/target/release/odbc2parquet /usr/local/bin/

ENTRYPOINT ["/usr/local/bin/odbc2parquet"]

You can install cargo from here https://rustup.rs/.

Usage

Use odbc2parquet --help to see all commands.

Query

Use odbc2parquet help query to see all options related to fetching data.

Query using connection string

odbc2parquet query \
--connection-string "Driver={ODBC Driver 17 for SQL Server};Server=localhost;UID=SA;PWD=<YourStrong@Passw0rd>;" \
out.par  \
"SELECT * FROM Birthdays"

Query using data source name

odbc2parquet query \
--dsn my_db \
--password "<YourStrong@Passw0rd>" \
--user "SA" \
out.par1 \
"SELECT * FROM Birthdays"

Use parameters in query

odbc2parquet query \
--connection-string "Driver={ODBC Driver 17 for SQL Server};Server=localhost;UID=SA;PWD=<YourStrong@Passw0rd>;" \
out.par  \
"SELECT * FROM Birthdays WHERE year > ? and year < ?" \
1990 2010

List available ODBC drivers

odbc2parquet list-drivers

List available ODBC data sources

odbc2parquet list-data-sources

Inserting data into a database

odbc2parquet insert \
--connection-string "Driver={ODBC Driver 17 for SQL Server};Server=localhost;UID=SA;PWD=<YourStrong@Passw0rd>;" \
input.par \
MyTable

Use odbc2parquet help insert to see all options related to inserting data.

Thanks to @samaguire there is a script for Powershell users which helps you to download a bunch of tables to a folder: https://github.com/samaguire/odbc2parquet-PSscripts

Dependencies

~21–40MB
~748K SLoC