7 releases
new 0.3.1 | Jan 14, 2025 |
---|---|
0.3.0 | Jan 14, 2025 |
0.2.2 | Jan 14, 2025 |
0.1.3 | Dec 16, 2024 |
#80 in Biology
364 downloads per month
130KB
2.5K
SLoC
AWRY
Avx Windowed fm-index in Rust? Yes!
Generates an Fm-Index of a given biological sequence text (Fasta or Fastq file), and implements Locate() and Search() functionalities.
AWRY is a port of a state-of-the-art, fastest in its class FM-index implementation (https://doi.org/10.1186/s13015-021-00204-6). AWRY supports parallelized searching, with parallel_count() and parallel_locate() functions. As AWRY is powered by custom SIMD operations, the library currently supports x86_64 and aarch64 machines with AVX2 and ARM Neon.
AWRY supports DNA, RNA, and protein alphabets, and is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries. These alphabets are case-insensitive- A
and a
are considered the same symbol under the hood.
Building an FM-index
to build an fm-index, create an FmBuildArgs struct, and call FmIndex::new()
let buildArgs = FmBuildArgs {
input_file_src: "my_input.fa", //sets what the input file for the database text will be
suffix_array_output_src: None, //will build to a default location
suffix_array_compression_ratio: None, // ratio of suffix array compression, 8 by default
lookup_table_kmer_len: None, //by default, chooses reasonable table sizes (Dna=13, Amino=5)
alphabet: SymbolAlphabet::Nucleotide, //alphabet to build
max_query_len: None, //if set, only sort suffix array up to n positions
remove_intermediate_suffix_array_file: true,//deletes the suffix array file if true
}
let fm_index = FmIndex::new(&buildArgs);
If you only intend to use the count function, you can set the suffix array compression to a high value like 255 to reduce memory usage.
Searching for a query
To search for a query, use to count_string and locate_string functions.
pub fn count_string(&self, query: &String) -> u64 {
...
}
/// Finds the locations in the original text of all isntances of the given query.
pub fn locate_string(&self, query: &String) -> Vec<u64> {
...
}
Searching for queries in parallel
To find a large number of queries, searching can be parallelized easily with the parallel_count and parallel_locate functions
pub fn parallel_count(&self, queries: &Vec<String>) -> Vec<u64> {
...
}
// Finds the locations for each query in the query list. This function uses rayon's into_par_iter() for parallelism.
pub fn parallel_locate(&self, queries: &Vec<String>) -> Vec<Vec<u64>> {
...
}
Dependencies
~7–17MB
~254K SLoC