4 releases

0.1.3 Feb 23, 2023
0.1.2 Feb 23, 2023
0.1.1 Feb 23, 2023
0.1.0 Feb 10, 2023

#2233 in Database interfaces

MIT license

14KB
109 lines

CSV Uploader

A custom CSV -> DB uploader program.

Speed

Trust me, you'll need speed when uploading 5M records.

Parallelized in a two step process (looped):

  1. We buffer records in an array as we read and parse (ex. 1000 records). This is the reader (main thread)
  2. Once that array fills up, we push the asynchronous upload future/task to a stack to be executed. (ex. 4 uploader threads)

Warning!: the paralellization between threads (step 2) is still being worked on. I'm still reading up on the tokio library lol. :)

Custom Data

As a secondary goal. We normalize the data while we parse it.

This is highly variable and dependant on two things:

  1. The DB and the Data Types it uses.
  2. The datasets we're uploading and the type of data we've seen so far.

So our current process is:

  • Parse to JSON data types
  • Drop any empty String values
  • Parse "False" -> false, "True" -> true
  • Replace ' inside Strings to " and try parsing again (because there's been some datasets in which that's been the case)

Supported DB's (for now)

  • RethinkDB

Data Tested

Dependencies

~18–32MB
~576K SLoC