1 unstable release
0.1.0 | Oct 11, 2023 |
---|
#52 in #write-file
145KB
1K
SLoC
scispeak
a rust parser to convert sci-seq-v3 reads into kallisto compatible formats
a CLI tool to whitelist filter sci-seq-v3 reads and convert them to a 10X-style format.
Overview
This tool is used to filter sciseq reads against their respective barcode whitelists and then output fastq file formats in the style of 10X reads.
This parses the sci-seq-v3 format, identifies the cell barcodes and UMIs and writes out a new file to resemble the 10X sequence construct to be used with other tools that have not yet adopted the sci-seq format.
sci-rna-seq3 Sequencing Construct
The sci-rna-seq3 sequencing construct is organized in the following way:
┌─'illumina_p5:29'
├─'i5:10'
├─'truseq_read_1_adapter:33'
│ ┌─'hairpin_barcode:10'
│ ├─'hairpin_adapter:6'
├─read_1─────────────────────┤
│ ├─'umi:8'
──RNA───────┤ └─'cell_bc:10'
├─'poly_T:98'
├─'read_2:98'
│ ┌─'ME:19'
├─i7_primer──────────────────┤
│ └─'s7:15'
├─'i7:10'
└─'illumina_p7:24'
Visualization from seqspec.
And so the resulting R1 and R2 files boil down to:
# R1
[linker][adapter][umi][barcode]
# R2
[cDNA]
Usage
This is a single command CLI tool. It requires just the R1 and R2 filepaths
scispeak \
-i data/SRR7827205_sample_R1.fastq.gz \
-I data/SRR7827205_sample_R2.fastq.gz;
However, it can be accelerated using multiple compression threads:
scispeak \
-i data/SRR7827205_sample_R1.fastq.gz \
-I data/SRR7827205_sample_R2.fastq.gz \
-t 8;
And can store a log file as well to keep matching statistics:
scispeak \
-i data/SRR7827205_sample_R1.fastq.gz \
-I data/SRR7827205_sample_R2.fastq.gz \
-t 8 \
-l;
Outputs
This program will output 3 files per run:
<args.prefix>_R1.fastq.gz
: A fastq with the[barcode][UMI]
construct for all reads passing the whitelist.<args.prefix>_R2.fastq.gz
: An unaltered fastq of the R2 for all reads passing the whitelist.<args.prefix>_log.json
: A log file containing the filtering statistics of the run.
Dependencies
~9–17MB
~189K SLoC