#suffix-array #search #fm-index #bioinformatics #search-query #fasta-fastq #exact-match-search

awry

Library for creating FM-indexes from FASTA/FASTQ files. AWRY is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries.

7 releases

new 0.3.1 Jan 14, 2025
0.3.0 Jan 14, 2025
0.2.2 Jan 14, 2025
0.1.3 Dec 16, 2024

#80 in Biology

Download history 89/week @ 2024-12-05 396/week @ 2024-12-12 17/week @ 2024-12-19 328/week @ 2025-01-09

364 downloads per month

BSD-3-Clause

130KB
2.5K SLoC

AWRY

Avx Windowed fm-index in Rust? Yes!

Generates an Fm-Index of a given biological sequence text (Fasta or Fastq file), and implements Locate() and Search() functionalities.

AWRY is a port of a state-of-the-art, fastest in its class FM-index implementation (https://doi.org/10.1186/s13015-021-00204-6). AWRY supports parallelized searching, with parallel_count() and parallel_locate() functions. As AWRY is powered by custom SIMD operations, the library currently supports x86_64 and aarch64 machines with AVX2 and ARM Neon.

AWRY supports DNA, RNA, and protein alphabets, and is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries. These alphabets are case-insensitive- A and a are considered the same symbol under the hood.

Building an FM-index

to build an fm-index, create an FmBuildArgs struct, and call FmIndex::new()

let buildArgs =  FmBuildArgs {
    input_file_src: "my_input.fa",              //sets what the input file for the database text will be
    suffix_array_output_src: None,              //will build to a default location
    suffix_array_compression_ratio: None,       // ratio of suffix array compression, 8 by default
    lookup_table_kmer_len: None,                //by default, chooses reasonable table sizes (Dna=13, Amino=5)
    alphabet: SymbolAlphabet::Nucleotide,       //alphabet to build
    max_query_len: None,                        //if set, only sort suffix array up to n positions
    remove_intermediate_suffix_array_file: true,//deletes the suffix array file if true
}

let fm_index = FmIndex::new(&buildArgs);

If you only intend to use the count function, you can set the suffix array compression to a high value like 255 to reduce memory usage.

Searching for a query

To search for a query, use to count_string and locate_string functions.

pub fn count_string(&self, query: &String) -> u64 {
    ...
}

/// Finds the locations in the original text of all isntances of the given query.
pub fn locate_string(&self, query: &String) -> Vec<u64> {
    ...
}

Searching for queries in parallel

To find a large number of queries, searching can be parallelized easily with the parallel_count and parallel_locate functions

pub fn parallel_count(&self, queries: &Vec<String>) -> Vec<u64> {
    ...
}

// Finds the locations for each query in the query list. This function uses rayon's into_par_iter() for parallelism.
pub fn parallel_locate(&self, queries: &Vec<String>) -> Vec<Vec<u64>> {
    ...
}

Dependencies

~7–17MB
~254K SLoC