2 unstable releases
0.2.1 | Jan 2, 2024 |
---|---|
0.1.0 | Nov 2, 2023 |
#499 in Science
750KB
338 lines
noel
An extremely fast GTF/GFF per gene Non-Overlapping Exon Length calculator (noel) written in Rust.
Takes in a GTF/GFF file and outputs a .txt file with non-overlapping exon lengths.
Usage
Usage: noel --i <GTF/GFF> --o <OUTPUT>
Arguments:
--i <GTF/GFF>: GTF/GFF file
--o <OUTPUT>: .txt file
Options:
--help: print help
--version: print version
crate: https://crates.io/crates/noel
Installation
to install noel on your system follow this steps:
- download rust:
curl https://sh.rustup.rs -sSf | sh
on unix, or go here for other options - run
cargo install noel
(make sure~/.cargo/bin
is in your$PATH
before running it) - use
noel
with the required arguments
Build
to build noel from this repo, do:
- get rust (as described above)
- run
git clone https://github.com/alejandrogzi/noel.git && cd noel
- run
cargo run --release <GTF/GFF> <OUTPUT>
(arguments are positional, so you do not need to specify --i/--o)
Library
to include noel as a library and use it within your project follow these steps:
-
include
noel = 0.2.0
ornoel = "*"
under[dependencies]
in yourCargo.toml
file or just runcargo add noel
from the command line -
the library name is
noel
, to use it just write:use noel::{noel, noel_reader};
or
use noel::*;
-
invoke
let exons: HashMap<String, Vec<(u32, u32)>> = noel_reader(input: &PathBuf)? let lengths: Vec<(String, u32)> = noel(exons)
-
you will end with a HashMap, where each gene name (gene_id) is a key to its length
[("ENSG00000261469": 533), ("ENSG00000150990": 6908), ("ENSG00000136490": 4751), ("ENSG00000290760": 801)]
Benchmark
There are a handful amount of open-sourced tools/software/scripts to calculate non-overlapping exon lengths, namely: Kooi [1], Sun [2], and Slowikowski [3, 4] scripts, and gtftools (-l flag) [5]. The Non-Overlapping Exon Length calculator (NOEL; referred just as "noel"), is introduced as a novel tool that outperforms the aforementioned software due to its remarkable performance.
To assess the efficiency of noel and test the capabilities of other available scripts/tools, I used run times and memory usage estimates, based on 5 consecutive runs. This evaluation focused on two major gene annotation formats: GTF and GFF. It is worth nothing, however, that only 3 tools are capable of handling GFF files: Slowikowski, Sun* (described below) and noel. Before any batch of runs, I first modified each script to be CLI-responsive. Additionally, I further edited Sun's script to be able to handle GFF inputs by changing a regex pattern. No performance enhance-related changes or breaking structural modifications were applied.
Lastly, to evaluate the output consistency of the top-ranked tools (Sun, gtftools and noel), three species were used: Homo sapiens (GRCh38, GENCODE 44), Canis lupus familiaris (ROS_Cfam_1.0, Ensembl 110), and Mus musculus (GRCm39, GENCODE M33).
The diverse methodologies to calculate non-overlapping exon lengths led to noticeable differences in run times. While Kooi and Slowikowski scripts were the last ranked (>250s for GENCODE 44) with GTF files and Slowikowski only for GFF files (~300s for GENCODE 44); Sun, gtftools and noel were the most efficient options (<50s for GENCODE 44). When analyzing these top-ranked tools, it is quickly perceived the noel's dominance over its competitors. For GTF files, noel achieves noticeably faster computation times when compared to gtftools (x4.3 faster; 4.2s vs 17.9s) and to Sun's script (x10.9 speedup; 4.2s vs 45.7s). On the other hand, noel performs the calculations on GFF3 x12.6 times faster than Sun's script (3.9s vs 49.7s).
A similar pattern is seen when examining memory usage estimates based on GTF files. Three distinct groups of tools can be identified: high-memory-consuming tools (Sun, Slowikowski, and Kooi), tools with moderate memory usage (gtftools), and the most memory-efficient option (noel). Here, noel exhibited a significantly lower memory usage when compared to gtftools (x9.1 less; 42.9 Mb vs 391.8 Mb) and to Kooi (x73.1 less; 42.9 Mb vs 3.1 Gb). With GFF files, on the other hand, noel achieved a striking x146.1-fold reduction in memory usage compared to Slowikowski (62,700 genes).
The comparison of output from the top-ranked tools, including Sun, gtftools, and noel, yielded consistently paired estimates for each species, resulting in a high correlation (R = 0.99). Notably, both noel and Sun's script demonstrated a one-to-one correspondence for every gene in all tested annotation models. In contrast, gtftools exhibited limitations in processing genes, with a slight deficiency in the human and mouse models (0.05% and 0.06%, respectively), and a more substantial shortfall in the dog model (26%). Furthermore, noel outperformed the other tools, significantly improving runtime efficiency in both the mouse and dog models, with a speedup of at least 2.3 times.
Based on this comparative analysis between existing scripts/software to calculate non-overlapping exonic lengths and noel, it is evident that this tool represents a significant improvement. These findings unveil the potential of noel as a valuable resource to provide a fast and efficient way to automate non-overlapping exon length calculations.
References
[1] https://www.biostars.org/p/83901/
[2] https://gist.github.com/jsun/aeca04ee2c5b5cc53ad795b660edd6c3
[3] https://gist.github.com/slowkow/8101481
[4] https://gist.github.com/slowkow/8101509#file-coding_lengths-py
[5] Hong-Dong Li, Cui-Xiang Lin, Jiantao Zheng, GTFtools: a software package for analyzing various features of gene models, Bioinformatics, Volume 38, Issue 20, 15 October 2022, Pages 4806–4808, https://doi.org/10.1093/bioinformatics/btac561
Dependencies
~2.3–3MB
~59K SLoC