2 releases
0.0.2 | Aug 16, 2024 |
---|---|
0.0.1 | Aug 15, 2024 |
#78 in Biology
21KB
267 lines
chromsize
annoyed to have to create an index and cut it?
have to look for that old script every time?
got you. just get your chrom sizes. very fast.
but first, how is this better than any other option? yeah, just check the image below.
googled 'get chromosome sizes from fasta', grab every command/tool I found and benchmarked it. surprisingly, you can lose 14 seconds of your life just waiting for those chrom sizes to be calculated. crazy.
What's new on v.0.0.2?
- now reads .gz!
- CI implementation
Usage
Binary
Usage: chromsize --fasta <FASTA> --output <OUTPUT> [-t <THREADS>]
Arguments:
-f, --fasta <FASTA>: FASTA file
-o, --output <OUTPUT>: path to chrom.sizes
Options:
-t, --threads <THREADS>: number of threads [default: your max ncpus]
--help: print help
--version: print version
crate: https://crates.io/crates/chromsize
Installation
to install rust and use chromsize on your system follow this steps:
- get installer:
curl https://sh.rustup.rs -sSf | sh
on unix, or go here for other options - run
cargo install chromsize
(make sure~/.cargo/bin
is in your$PATH
before running it) - use
chromsize
with the required arguments
Library
use chromsize;
fn main() {
let input = PathBuf::new("/path/to/fasta.fa");
let output = PathBuf::new("/path/to/chrom.sizes");
let sizes: Vec<(String, u64)> = chromsize::chromsize(&input);
chromsize::write(sizes, &output)
}
Python
build the port to install it as a pkg:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize/py-chromsize
hatch shell
maturin develop --release
use it as a binary wrapper:
import chromsize as cs
input = "/path/to/fasta.fa"
output = "/path/to/chrom.sizes"
cs.write_chromsizes(input, output)
or just get them directly
import chromsize as cs
input = "/path/to/fasta.fa"
sizes = cs.get_chromsizes(input)
>>> print(sizes)
[
('chr1', 123),
('chr2', 456),
...
]
Build
to build chromsize from this repo, do:
- get rust
- run
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize
- run
cargo run --release -- -i <GTF> -o <OUTPUT>
Container image
to build the development container image:
- run
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize
- initialize docker with
start docker
orsystemctl start docker
- build the image
docker image build --tag chromsize .
- run
docker run --rm -v "[dir_where_your_fa_is]:/dir" chromsize -f /dir/<INPUT> -o /dir/<OUTPUT>
Conda (not available yet)
to use chromsize through Conda just:
conda install chromsize -c bioconda
orconda create -n chromsize -c bioconda chromsize
Nextflow (not available yet)
Benchmark
do not believe me? run the benchmark on your own:
- get .fa from any species you want (or download the ones I used from UCSC/NCBI)
- install hyperfine: https://github.com/sharkdp/hyperfine
- go to chromsize/bench and modify the
ASSEMBLIES
const with the .fa you've download - run
cargo run release --bin chromsize-benchmark -- -d /dir/where/my/fastas/are -a show-output ignore-failure
here is all the info and metadata from my experiment:
which tools I used?
Tool | Command | Reference | Discussion |
---|---|---|---|
seqkit | seqkit fx2tab --length --name --header-line {assembly} > chrom.sizes |
1 | 2 |
chromsize | target/release/chromsize -f {assembly} -o chrom.sizes |
3 | |
pyfaidx | faidx {assembly} -i chromsizes > chrom.sizes |
4 | 5 |
samtools | samtools faidx {assembly} && wait | cut -f1,2 {assembly}.fai > chrom.sizes |
6 | 5 |
faSize | faSize -detailed -tab {assembly} > chrom.sizes |
7 | |
awk1 | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' {assembly} > chrom.sizes |
8 | 9 |
awk2 | awk '/^>/{if (l!=") print l; print; l=0; next}{l+=length($0)}END{print l}' {assembly} > chrom.sizes |
8 | 9 |
bioawk1 | bioawk -c fastx '{print > $name ORS length($seq)}' {assembly} > chrom.sizes |
10 | 9 |
awk3 | cat {assembly} | awk '$0 ~ > {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom.sizes |
8 | 11 |
bioawk2 | bioawk -c fastx '{ print $name, length($seq) }' < {assembly} > chrom.sizes |
10 | 2 |
detailed data?
Species | Assembly | Size (Gb) | chromsize | seqKit | awk1 | awk2 | awk3 | bioawk1 | bioawk2 | faSize | pyfaidx | samtools |
---|---|---|---|---|---|---|---|---|---|---|---|---|
S. cerevisiae | R64 | 0.01 | 0.004 | 0.016 (X 4.0) | 0.043 (X 10.7) | 0.043 (X 10.7) | 0.05 (X 12.5) | 0.03 (X 7.5) | 0.03 (X 7.5) | 0.054 (X 13.5) | 0.101 (X 25.2) | 0.064 (X 16.0) |
C. elegans | ce11 | 0.10 | 0.02 | 0.103 (X 5.1) | 0.409 (X 20.4) | 0.408 (X 20.4) | 0.492 (X 24.6) | 0.274 (X 13.7) | 0.274 (X 13.7) | 0.426 (X 21.3) | 0.225 (X 11.2) | 0.472 (X 23.6) |
D. melanogaster | dm6 | 0.14 | 0.028 | 0.147 (X 5.2) | 0.581 (X 20.7) | 0.583 (X 20.8) | 0.714 (X 25.5) | 0.426 (X 15.2) | 0.418 (X 14.9) | 0.633 (X 22.6) | 0.337 (X 12.0) | 0.667 (X 23.8) |
D. rerio | danRer11 | 1.37 | 0.22 | 0.742 (X 3.4) | 6.815 (X 31.0) | 6.803 (X 30.9) | 8.216 (X 37.3) | 3.946 (X 17.9) | 3.95 (X 18.0) | 7.202 (X 32.7) | 3.029 (X 13.8) | 7.633 (X 34.7) |
C. familiaris | canFam4 | 2.48 | 0.311 | 1.209 (X 3.9) | 10.158 (X 32.7) | 10.124 (X 32.6) | 12.206 (X 39.2) | 6.55 (X 21.1) | 6.518 (X 21.0) | 10.671 (X 34.3) | 4.741 (X 15.2) | 11.394 (X 36.6) |
H. sapiens | GRCh38 | 3.10 | 0.43 | 1.696 (X 3.9) | 12.393 (X 28.8) | 12.432 (X 28.9) | 13.681 (X 31.8) | 7.414 (X 17.2) | 7.284 (X 16.9) | 13.102 (X 30.5) | 6.37 (X 14.8) | 14.074 (X 32.7) |
B. bombina | aBomBom1 | 9.80 | 1.554 | 8.501 (X 5.5) | 41.676 (X 26.8) | 41.696 (X 26.8) | 49.064 (X 31.6) | 24.202 (X 15.6) | 24.374 (X 15.7) | 43.856 (X 28.2) | 19.755 (X 12.7) | 45.387 (X 29.2) |
A. mexicanum | AmbMex60DD | 28.20 | 3.327 | 14.375 (X 4.3) | 118.923 (X 35.7) | 118.422 (X 35.6) | 137.781 (X 41.4) | 57.626 (X 17.3) | 57.591 (X 17.3) | 121.257 (X 36.4) | 54.82 (X 16.5) | 128.374 (X 38.6) |
P. annectens | PAN1.0 | 40.10 | 4.606 | 18.664 (X 4.1) | 167.85 (X 36.4) | 165.701 (X 36.0) | 196.833 (X 42.7) | 91.747 (X 19.9) | 91.924 (X 20.0) | 170.475 (X 37.0) | 77.707 (X 16.9) | 181.562 (X 39.4) |
how well performs with .gz?
CHM13-T2T.fa.gz
Tool | Cores | Time |
---|---|---|
seqkit | 16 | 18.993 s ± 0.132 s |
chromsize | default (max_cpus: 16) | 7.631 s ± 0.010 s |
seqkit | default (4) | 18.525 s ± 0.520 s |
chromsize | 4 | 8.035 s ± 0.077 s |
seqkit | 2 | 18.535 s ± 0.376 s |
chromsize | 2 | 8.284 s ± 0.030 s |
Dependencies
~2.8–3.5MB
~70K SLoC