#genomic #sequence #search #signature #processing #bigsi #sequence-search

bin+lib bigsig

Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG)

1 unstable release

Uses old Rust 2015

0.1.0 Aug 30, 2024

#245 in Biology

MIT license

175KB
4.5K SLoC

Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG)

This is a port of crate colorid with several updates for real-world application;

  1. Use xxh3 to suport aarch64 and x86-64 platforms;
  2. Use needletail for fast and compressed fasta/fastq file processing;
  3. 2-bit nucleitide sequence representation via kmerutils to improve memory efficiency;
  4. Recreate the command line interface using recent clap v4.3.

Credit for orginal implementation to original authors.

Install

git clone https://gitlab.com/Jianshu_Zhao/bigsig
cd bigsig
cargo build --release

Usage

 ************** initializing logger *****************

bigsig 0.1.0
Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG)

USAGE:
    bigsig [SUBCOMMAND]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    batch_identify    Identify batch of samples reads
    construct         Construct a BIGSIG
    filter            filters reads
    help              Prints this message or the help of the given subcommand(s)
    identify          identify reads based on probability
    query             query a bigsig on one or more fasta/fastq.gz files
    show              show index parameters

An example to build and query BigSig database

bigsig construct -r ref_file_example.txt -b test -k 31 -mv 21 -s 10000000 -n 4 -t 24
bigsig query -b ./test.mxi  -q ./test_data/test.fastq.gz 
bigsig identify -b test.mxi -q ./test_data/test.fastq.gz -n output -t 24 --high_mem_load

Results

With the default settings BigSiq will report reference sequences that share >35% of their k-mers with the query. Here is the output of a query with SRA accession SRR4098796 (L. monocytogenes lineage I) as query:

SRR4098796_1.fastq.gz	3076072	Listeria_monocytogenes_F2365	0.87	134.25	126	475266
SRR4098796_1.fastq.gz	3076072	Listeria_monocytogenes_SRR2167842	0.40	128.25	122	7831

In the first column we find the query, the second column shows the number of k-mers in the query, the third column displays the reference sequence, the fourth column the proportion of kmers in the reference shared with the query, the fifth column displays the average coverage based on k-mers that were uniquely matched with this reference, the sixth the modus of the coverage based on uniquely matched k-mers and the last column the number of uniquely matched k-mers.

Reference

  1. Bradley, Phelim, et al. "Ultrafast search of all deposited bacterial and viral genomic data." Nature biotechnology 37.2 (2019): 152-159.
  2. Bingmann, Timo, et al. "COBS: a compact bit-sliced signature index." String Processing and Information Retrieval: 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7–9, 2019, Proceedings 26. Springer International Publishing, 2019.

Dependencies

~6–13MB
~154K SLoC