9 releases
0.2.3 | May 22, 2024 |
---|---|
0.2.2 | Dec 24, 2023 |
0.2.1 | Aug 20, 2023 |
0.2.0 | Oct 8, 2022 |
0.1.3 | Jun 26, 2022 |
#84 in Biology
23 downloads per month
Used in atg
365KB
7K
SLoC
ATGlib
ATGlib is a Rust library to work with genomic transcript data. It handles several file formats, such as GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences.
ATGlib is useful for Bioinformaticians and Geneticists. It can be used for all kind of data processing workflows that must handle transcript annotations. The main use case are conversions between various file formats and sanity checks for large scale QC. It is primarily targeted for human genome, but should work equally well with all other genomes.
If you are looking for an actual application, or a command line tool to work with transcripts, GTF files etc, use ATG instead. It is using ATGlib behind the scenes and provides a simple to use interface.
Documentation
The library API is mostly documented inline and available on docs.rs
Examples
For examples on how to use ATGlib on a high-level, you can check the source code of the CLI tool ATG.
Convert GTF to RefGene
use atglib::gtf::Reader;
use atglib::refgene::Writer;
use atglib::models::{TranscriptRead, TranscriptWrite};
let mut reader = Reader::from_file("tests/data/example.gtf")
.unwrap_or_else(|_| panic!("Error opening input file."));
let mut writer = Writer::from_file("/dev/null")
.unwrap_or_else(|_| panic!("Unable to open output file"));
let transcripts = reader.transcripts()
.unwrap_or_else(|err| panic!("Error parsing GTF: {}", err));
match writer.write_transcripts(&transcripts) {
Ok(_) => println!("Success"),
Err(err) => panic!("Error writing RefGene file: {}", err)
};
Work with transcripts directly
use atglib::tests::transcripts::standard_transcript;
let tx = standard_transcript();
println!(
"The {} transcript {} is on the {} strand on chromsome {} and has {} exons",
tx.gene(),
tx.name(),
tx.strand(),
tx.chrom(),
tx.exon_count()
);
if tx.is_coding() {
println!(
"The start and end positions of the CDS are: {}-{}",
tx.cds_start().unwrap(),
tx.cds_end().unwrap()
);
} else {
println!(
"The transcript is non-coding and is located at {}:{}-{}",
tx.chrom(),
tx.tx_start(),
tx.tx_end()
);
}
Limitations
I started ATGlib as a private side project to learn Rust
. I'm sure there are many parts of the code that are not idiomatic and can be improved. I'd be more than happy to receive feedback and suggestions for improvement. I also encourage everyone, who is interested, to help and contribute the ATGlib.
When it comes to functional correctness, I try my best to test the functionality on all levels. ATGlib has a very good test coverage and I also run manual checks before every release with a huge test-set. However, I cannot guarantee that everything is correct, so please use it at your own risk and report any bugs and issues via Github.
ToDo / Next tasks
- Add function to compare transcripts from two different inputs
- use Smartstring or Smallstr for gene-symbol, transcript name and chromosome
- Parallelize input parsing
- Check if exons can be stored in smaller vec
- Use std::mem::replace to move out of attributes, e.g. in TranscriptBuilder and remove Copy/Clone traits https://stackoverflow.com/questions/31307680/how-to-move-one-field-out-of-a-struct-that-implements-drop-trait
- Change
Codon
toGenomicCodon
- Update error handling and streamling error types
Known issues
GTF parsing
- NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon
Dependencies
~0.4–1MB
~23K SLoC