1 unstable release
0.1.1 | Sep 20, 2023 |
---|
#9 in #single-cell
29KB
651 lines
splici
a rust implementation of the splici algorithm to build spliced/unspliced transcripts
Overview
This implementation is written fully in rust and takes advantage of three bioinformatics libraries:
gtftools
- For parsing of GTF filesbedrs
- For genomic interval arithmeticfaiquery
- For fast querying of indexed fastas
Usage
splici introns \
-f <your.fasta> \
-g <your.gtf> \
-o splici.fasta.gz;
This will generate a splici reference fasta using the transcripts and exons found within the gtf and will query from the indexed fasta provided.
This expects that the fasta is indexed using samtools faidx
.
Getting Started
You can download the latest ensembl DNA and GTF using ggetrs ensembl ref
ggetrs ensembl ref -D -d dna,gtf
Unzip and index the reference DNA.
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
And then run splici
to generate your splici reference fasta
splici introns \
-f Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-g Homo_sapiens.GRCh38.*.gtf.gz \
-o splici.fasta.gz;
Background
The splici algorithm was described by (He et al. 2022) and is a shorthand for spliced + intronic sequences.
It describes a method to isolate the intronic regions of all incoming transcripts and generate the sequences of both the spliced transcripts as well as their intronic components.
The algorithm is applied on each gene individually.
First all transcripts for a gene are identified. Then all intronic regions of those transcripts are identified. These intronic regions are defined by the span of the transcripts subtracting out the exonic intervals (see internal). Next, each intronic region is extended by some parameterized amount on both ends, which allows for alignment to junctions between intronic and exonic regions. Intronic regions between isoforms generally have high overlap, so a merging step is performed on the intronic regions to avoid redundant intervals in the final sequences. These intronic regions are then given a unique name and added to the splici reference.
The spliced transcripts are generated by concatenating the exonic intervals for each transcript. These are named by the transcript id and added to the splici reference.
References
- He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022).
Dependencies
~12MB
~169K SLoC