#fasta #fuse #bioinformatics #mount-point

bin+lib fusta

FUSTA leverages the FUSE interface to transparently manipulate multiFASTA files as independent files

3 stable releases

1.7.1 Jan 23, 2023
1.6.0 Aug 17, 2022
1.5.5 Aug 16, 2022

#191 in Biology

CECILL-C

88KB
2K SLoC

  • TL;DR #+begin_src shell fusta GRCh38.fa -o hg38 cat hg38/fasta/chr{X,Y}.fa > ~/sex_chrs.fa cat hg38/get/chr17:18108706-18179802 > MYO15A.fa rm hg38/seq/chr{3,5}.seq fusermount -u hg38 #+end_src
  • What is FUSTA? FUSTA is a FUSE-based virtual filesystem mirroring a (multi)FASTA file as a hierarchy of individual virtual files, simplifying efficient data extraction and bulk/automated processing of FASTA files.

[file:fusta.png]

The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility with all existing programs. When handling large multiFASTA files, the intrinsic file caching capacities of the OS are leveraged to ensure the best experience to the user.

** Citation

If you use FUSTA, please cite [[https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac091/6851693][FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale]], https://doi.org/10.1093/bioadv/vbac091

** Licensing FUSTA is distributed under the CeCILL-C (LGPLv3 compatible) license. Please see the LICENSE file for details.

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Fedora/Rocky Linux/Alma Linux #+begin_src sudo yum install rust cargo fuse3 fuse3-devel cargo install --git https://github.com/delehef/fusta #+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Debian #+begin_src bash sudo apt install curl fuse3 libfuse3-dev curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Debian cargo is outdated #+end_src then rebot your shell to update the =PATH= environment variable.

Finally, install FUSTA: #+begin_src bash cargo install --git https://github.com/delehef/fusta #+end_src You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Scientific Linux #+begin_src bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh sudo yum install fuse3 fuse3-devel #+end_src then rebot your shell to update the =PATH= environment variable.

Finally, install FUSTA: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** macOS On macOS, you will need to install the build tools if you have not them ready yet: =xcode-select --install=

You must then download and install [[https://osxfuse.github.io/][FUSE for macOS]] in order to be able to use FUSTA.

Finally, to build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src ** FreeBSD #+begin_src bash sudo pkg install rust pkgconf fusefs-libs # Install build dependencies sudo sysctl vfs.usermount=1 # enable FUSE mounting without requiring administrator permissions sudo kldload fuse # load the FUSE kernel module #+end_src

Finally, install FUSTA: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** From Sources You should install FUSE (as well as its potential =devel= package), from your package manager – note that a reboot might be necessary for the kernel module to be loaded.

To build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use.

  • Usage ** Quick Start These commands run =fusta= in the background, mount the FASTA file =file.fa= in an automatically created =fusta= folder, exposing all the sequences contained in =file.fa= there. The call to =tree= will display the virtual hierarchy, then =fusermount= is called to cleanly unmount the file.

#+begin_src fusta file.fa tree -h fusta/ fusermount -u fusta #+end_src ** Description Once started, =fusta= will expose the content of a FASTA file in a way that makes it usable by any piece of software using as if it were a set of independent files, detailed as follow.

For instance, here is the virtual hierarchy created by =fusta= after mounting a FASTA file containing /A. thaliana/ genome #+begin_src fusta ├── append ├── fasta │   ├── 1.fa │   ├── 2.fa │   ├── 3.fa │   ├── 4.fa │   ├── 5.fa │   ├── Mt.fa │   └── Pt.fa ├── get ├── infos.csv ├── infos.txt ├── labels.txt └── seqs ├── 1.seq ├── 2.seq ├── 3.seq ├── 4.seq ├── 5.seq ├── Mt.seq └── Pt.seq #+end_src

FUSTA supports all FUSTA files using UNIX-style line endings, including but not restricted to DNA files, protein files, gapped files, mixed-case files, and independently of their inner formatting (line wrapping, line length, /etc./). *** =infos.csv= This read-only CSV file contains a list of all the fragments present in the mounted FASTA file, with, for each of them, the standard =id= and =additional informations= field, plus a third one containing the length of the sequence. *** =infos.txt= This read-only text file provides the same informations, but in a more human-readable format. *** =labels.txt= This read-only file contains a list of all the sequence headers present in the mounted FASTA file. *** =fasta= This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read-only FASTA files. *** =seqs= This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read/write files containing only the sequences - without the FASTA headers, but with any newline preserved. These files can be read, copied, removed, edited, etc. as normal files, and any alteration will be reflected on the original FASTA file when fusta is closed. *** =append= This folder should be used to add new sequences to the mounted FASTA file. Any valid fasta file copied or moved to this directory will be appended to the original FASTA files. It should be noted that the process is completely transparent and the the folder will remain empty, even though the operation is successful. *** =get= This folder is used for range-access to the sequences in the mounted FASTA file. Although it is empty, any read access to a (non-existing) file following the pattern =SEQID:START-END= will return the corresponding range (1-indexed, fully-closed) in the specified sequence. It should be noted that the access skip headers and newlines, so that the =START-END= coordinates map to actual loci in the corresponding sequence and not to bytes in the mounted FASTA file. ** Examples All the following examples assume that a FASTA file has been mounted (/e.g./ =fusta -D genome.fa=), and is unmounted after manipulation (/e.g./ =fusermount -u fusta=). *** Get an overview of the file content #+begin_src shell cat fusta/infos.txt #+end_src *** Extract individual sequences as FASTA files #+begin_src shell cat fusta/fasta/chr{X,Y}.fa > ~/sex_chrs.fa #+end_src *** Extract a range of chromosome 12 #+begin_src shell cat fusta/get/chr12:12000000-12002000 #+end_src *** Remove sequences from the original file #+begin_src shell rm fusta/seq/chr{3,5}.seq #+end_src *** Add a new sequence #+begin_src shell cp more_sequences.fa fusta/append #+end_src *** Upcasing a sequence #+begin_src shell sed 's/[a-z]/\U&/g' fusta/seqs/chr21.seq | sponge fusta/seqs/chr21.seq #+end_src *** Edit the mitochondrial genome #+begin_src shell nano fusta/seq/chrMT.seq #+end_src *** Batch-rename chromosomes #+begin_src shell cd fusta/seq; for i in *; do mv ${i} chr${i}; done #+end_src *** Use independent sequences in external programs #+begin_src shell blastn mydb.db -query fusta/fasta/seq25.fa asgart fusta/fasta/chrX.fa fusta/asgart/chrY.fa --out result.json #+end_src ** Compressed FASTA files FUSTA only works with uncompressed (multi)FASTA files. If you wish to use FUSTA on compressed (multi)FASTA files, we recommend to use [[https://github.com/yhoogstrate/fastafs][FASTAFS]] as an intermediary to expose a compressed (multi)FASTA file to FUSTA without requiring to ully uncompress it. ** Runtime options #+begin_src USAGE: fusta [OPTIONS]

ARGS: A (multi)FASTA file containing the sequences to mount

OPTIONS: -C, --max-cache Set the maximum amount of memory to use to cache writes (MB) [default: 500] --cache Use either mmap, fseek(2) or memory-backed cache to extract sequences from FASTA files. WARNING: memory caching use as much RAM as the size of the FASTA file should be available. [default: mmap] [possible values: file, mmap, memory] -D, --no-daemon Do not daemonize -h, --help Print help information -o, --mountpoint Specifies the directory to use as mountpoint; it will be created if it does not exist -S, --sep Set the separator to use in CSV files [default: ,] -v Sets the level of verbosity -V, --version Print version information -W, --allow-overwrite allow FUSTA to overwrite existing sequences, when (i) appending new sequences conflicting with an existing ID, (ii) renaming sequences #+end_src

*** =--cache= The cache option is key in adapting FUSTA to your use, and for files of non-trivial size, a correct choice is the difference between a memory overflow and a smooth run:

  • =file= :: in this mode, FUSTA store all the fragments as offsets in their file, and access them through =fseek= accesses. The performances will probably be the worse, but memory consumption will be kept to the minimal.
  • =mmap= :: this mode is extremely similar to the previous one, safe that access will proceed through [[https://en.wikipedia.org/wiki/Mmap][mmmap(2)]] reads, leveraging the caching facilities of the OS -- this is the default mode.
  • =memory= :: in this mode, all fragments will directly be copied to memory. Performances will be at their best, but enough memory should be available to store the entirety of the processed files.
  • Troubleshooting *** I get a "Cannot allocate memory" error The FASTA files may be overflowing the default setting of the memory overcommit guard. You may change the overcommiting setting with =sysctl -w vm.overcommit_memory 1=, or use =--cache=file= for less performances, but less virtual memory pressure. *** I still get a "Cannot allocate memory" error Your FASTA file may contain too many fragments w.r.t. the number of mmap pages that can be mapped by a program. You may increase =max_map_count= with =sysctl -w vm.max_map_count 200000=, or use =--cache=file= for less performances, but less virtual memory pressure. *** I have another error [[https://github.com/delehef/fusta/issues][Open an issue stating your problem!]]
  • Contact If you have any question or if you encounter a problem, do not hesitate to [[https://github.com/delehef/fusta/issues][open an issue]].
  • Acknowledgments FUSTA is standing on the shoulders of, among others, [[https://github.com/cberner/fuser][fuser]], [[https://github.com/clap-rs/clap][clap]], [[https://github.com/RazrFalcon/memmap2-rs][memmap2]] and [[https://github.com/knsd/daemonize][daemonize]].
  • Changelog ** v1.7.1
  • Fix missing newline in some cases ** v1.7
  • Use 1-based, fully-closed genomic coordinates ** v1.6.1
  • Accept more characters as FASTA sequences: =\n - _ . + == ** v1.6
  • Fix truncating
  • Improved error handling
  • Better notifications
  • Add a flag to allow overwrite of existing sequences as a side-effect
  • Only ASCII alphanumerical content can be written to sequence files
  • Refuse to open FASTA files with IDs containing characters invalid in a filename ** v1.5.7
  • Update dependencies ** v1.5.6
  • The default mount point is now =fusta-{filename}= ** v1.5.4
  • Fix mountpoint not being created ** v1.5.3
  • Improve notification system ** v1.5.2
  • Daemonize by default ** v1.5.1
  • Update memmap to memmap2 ** v1.5
  • Add an index on ino for better performances at the cost of a bit of memory ** v1.4
  • Daemonize /after/ parsing FASTA files, so that (i) errors appear immediately and (ii) performances are better when launching multiple instances in parallel. ** v1.3
  • Can now cache all fragments in memory: increased RAM consumption, but starkly reduced random access time ** v1.2.1
  • Bugfixes ** v1.2
  • FUSTA is now based on fuster instead of fuse-rs
  • Various optimization let FUSTA handle >40GB FASTA files in 6GB of RAM and much better performances
  • Added an optional notification system behind the =notifications= feature gate ** v1.1.1
  • Use MMAP by default. While it may lead to unpleasant load when performing heavy operation on very large files, this should be a rather uncommon case. ** v1.1
  • FUSTA can now directly extract ranges from a sequence ** v1.0
  • Initial release

Dependencies

~13–43MB
~694K SLoC