1 unstable release

new 0.1.0 Nov 25, 2024

#32 in Biology

Custom license

140KB
2.5K SLoC

LRGE

check test

Long Read-based Genome size Estimation from overlaps

LRGE (pronounced "large") is a command line tool for estimating genome size from long read overlaps. The tool is built on top of the liblrge Rust library, which is also available as a standalone library for use in other projects.

PREPRINT/PAPER COMING SOON

Table of Contents

Installation

Precompiled binary

GitHub Downloads (all assets, all releases) GitHub Release

curl -sSL lrge.mbh.sh | sh
# or with wget
wget -nv -O - lrge.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL lrge.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of lrge, if lrge is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: aarch64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/lrge/releases]

        -h, --help
                Display this help message

Conda

Conda Version Conda Platform Conda Downloads

conda install -c bioconda lrge

Cargo

Crates.io Version Crates.io Total Downloads

cargo install lrge

Container

Docker images are hosted on the GitHub Container registry.

Apptainer

Prerequisite: apptainer (previously Singularity)

$ URI="docker://ghcr.io/mbhall88/lrge:latest"
$ apptainer exec "$URI" lrge --help

The above will use the latest version. If you want to specify a version then use a tag like so.

$ VERSION="0.1.0"
$ URI="docker://ghcr.io/mbhall88/lrge:${VERSION}"

Docker

Prerequisite: docker

$ docker pull ghcr.io/mbhall88/lrge:latest
$ docker run ghcr.io/mbhall88/lrge:latest lrge --help

You can find all the available tags here.

Build from source

$ git clone https://github.com/mbhall88/lrge.git
$ cd lrge
$ cargo build --release
$ target/release/lrge -h

Usage

Estimate the genome size of a set of Mycobacterium tuberculosis ONT reads (true genome size: 4.40 Mbp / 4405449 bp).

$ wget -O reads.fq.gz "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR283/049/SRR28370649/SRR28370649_1.fastq.gz"
$ lrge -t 8 reads.fq.gz
[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads
[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)
4426642
[2024-11-22T03:50:10Z INFO  lrge] Done!

The size estimate is printed to stdout, but you can also save it to a file with the -o flag.

$ lrge -t 8 reads.fq.gz -o size.txt
[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads
[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)
[2024-11-22T03:50:10Z INFO  lrge] Done!
$ cat size.txt
4426642

By default, LRGE uses the two-set strategy with 10,000 target reads (-T) and 5,000 query reads (-Q). You can use the all-vs-all strategy by specifying the number of reads to use with the -n flag.

Library

You can also use the liblrge library in your Rust projects. This allows you to estimate genome size within your own applications - without needing to call out to lrge. For more details on how to use the library, see the documentation or the source code.

Standard options

$ lrge -h
Genome size estimation from long read overlaps

Usage: lrge [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input FASTQ file

Options:
  -o, --output <OUTPUT>      Output file for the estimate [default: -]
  -T, --target <INT>         Target number of reads to use (for two-set strategy; default) [default: 10000]
  -Q, --query <INT>          Query number of reads to use (for two-set strategy; default) [default: 5000]
  -n, --num <INT>            Number of reads to use (for all-vs-all strategy)
  -P, --platform <PLATFORM>  Sequencing platform of the reads [default: ont] [possible values: ont, pb]
  -t, --threads <INT>        Number of threads to use [default: 1]
  -C, --keep-temp            Don't clean up temporary files
  -D, --temp <DIR>           Temporary directory for storing intermediate files
  -s, --seed <INT>           Random seed to use - making the estimate repeatable
  -q, --quiet...             `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing
  -v, --verbose...           `-v` show debug output. `-vv` show trace output
  -h, --help                 Print help (see more with '--help')
  -V, --version              Print version

Full usage

Estimate genome size of PacBio reads

$ lrge -P pb -t 8 reads.fq

Don't remove the intermidiate read and overlap files

$ lrge -C reads.fq

Use the all-vs-all strategy with 10,000 reads

$ lrge -n 10000 reads.fq

Fix the seed so that subsequent runs return the same size estimate

$ lrge -s 123 reads.fq

By default, we take the median of the finite estimates to get the final genome size estimate. If you want to include infinite estimates in the calculation

$ lrge -8 reads.fq

If you don't want the estimate to be rounded to the nearest integer 🤓

$ lrge --float-my-boat reads.fq

In the paper, we suggest using the 15th and 65th percentiles of the estimates to get a ~92% confidence interval. However, you can change these

$ lrge --q1 0.25 --q3 0.75 reads.fq

If you want to see the estimate for each read, turn on trace level logging

$ lrge -vv reads.fq

By default, the intermediate files are stored in a temporary directory. You can specify a different temporary directory

$ lrge -D ./mytemp/ reads.fq

$ lrge --help
Genome size estimation from long read overlaps

Usage: lrge [OPTIONS] <INPUT>

Arguments:
  <INPUT>
          Input FASTQ file

Options:
  -o, --output <OUTPUT>
          Output file for the estimate

          [default: -]

  -T, --target <INT>
          Target number of reads to use (for two-set strategy; default)

          [default: 10000]
  -Q, --query <INT>
          Query number of reads to use (for two-set strategy; default)

          [default: 5000]

  -n, --num <INT>
          Number of reads to use (for all-vs-all strategy)

  -P, --platform <PLATFORM>
          Sequencing platform of the reads

          [default: ont]
          [possible values: ont, pb]

  -t, --threads <INT>
          Number of threads to use

          [default: 1]

  -C, --keep-temp
          Don't clean up temporary files

  -D, --temp <DIR>
          Temporary directory for storing intermediate files

  -s, --seed <INT>
          Random seed to use - making the estimate repeatable

  -8, --inf
          Take the estimate as the median of all estimates, *including infinite estimates*

  -f, --float-my-boat
          I neeeeeed that precision! Output the estimate as a floating point number

      --q1 <FLOAT>
          The lower quantile to use for the estimate

          [default: 0.15]

      --q3 <FLOAT>
          The upper quantile to use for the estimate

          [default: 0.65]

  -q, --quiet...
          `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing

  -v, --verbose...
          `-v` show debug output. `-vv` show trace output

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Method

For a full description of the method, see the paper.

Two-set strategy

The two-set strategy is the default method used by LRGE. It involves randomly selecting a two distinct subsets of reads from the input. One subset is deemed the target set ($T$) and the other the query set ($Q$). Each read $q_i$ in $Q$ is overlapped against $T$ and a genome size ($\textbf{GS}$) estimate is generated for that read ($\textbf{GS}_{T,q_i}$). The estimate is calculated based on the number of overlaps of $q_i$ with reads in $T$ ($O_{T,q_i}$), according to the formula:

\textbf{GS}_{T,q_i} \approx \frac{\vert T \vert \cdot \vert q_i \vert + \overline{t \in T} - 2 \cdot \textbf{OT}}{O_{T,q_i}}

where $\vert T \vert$ is the total size of the target set, $\vert q_i \vert$ is the length of read $q_i$, $\overline{t \in T}$ is the average length of reads in $T$, and $\textbf{OT}$ is the overlap threshold (minimum chain score in minimap2, which defaults to 100 for overlaps). See the paper for more formal/rigorous definitions.

Ultimately, the genome size estimate is the median of the finite estimates for each read in $Q$.

We use this strategy as the default as it is the most computationally efficient and the accuracy is comparable to the all-vs-all strategy. We suggest a smaller number of query reads than target reads, as this will speed things up and as we take the median of the estimates, the number of query reads (over a certain point) should not affect the accuracy of the estimate all that much.

All-vs-all strategy

The all-vs-all strategy involves overlapping some random subset (-n) of reads in the input against each other. The genome size estimate for each read is calculated as above, but we subtract one from $\vert T \vert$ to account for the fact that the read is not being overlapped against itself. We also do not factor the length of the read whose size is being estimated into the average read length calculation.

This strategy is generally more computationally expensive than the two-set strategy, but it can be more accurate. Though we did not find the difference to be statistically significant in our tests.

Results

We compared LRGE to three other methods: GenomeScope2, Mash, and Raven (see below for more info). We ran each method on 3370 read sets from PacBio or ONT data. Each of these samples is associated with a RefSeq assembly, so the true size was taken as the size of the RefSeq assembly. You can find the metadata for the samples here.

The full results are available in the paper and here. Here is a brief summary of how LRGE compares to other methods.

Results

This compares the absolute relative error as a percentage. The relative error ($\epsilon_{\text{rel}}$) is calculated as:

    \epsilon_{\text{rel}} = \frac{\hat{G} - G}{G} \cdot 100

where $G$ is the true genome size, and $\hat{G}$ is the estimated genome size. For example, a $\epsilon_{\text{rel}}$ of 50% is out (higher or lower) by 50% of the true genome size. So if the true genome size is 1 Mbp, a $\epsilon_{\text{rel}}$ of 50% would be 1.5 Mbp or 0.5 Mbp.

The following figure shows the (non-absolute) relative error for the same methods to give an indication of which methods tend to over or underestimate.

Results

Benchmark

For the full details of the methods benchmarked, see the paper. However, here is a brief summary of the results.

Benchmark

The statistical annotations above the violins are coloured by the method which has the lowest mean value for the given metric.

Alternatives

The methods we compare against are:

GenomeScope2: to get estimates from GenomeScope2, you need to first generate a k-mer spectrum. We used KMC for this. You can find a Python script that takes reads and generates a k-mer spectrum in genomescope.py. The list of parameters used can also be found in the workflow config.

Mash: we used mash sketch on the reads, which prints out the estimated genome size in the logging output. You can find the options used in the workflow config.

Raven: Raven essentially just assembles the reads - REALLLLY fast 🚀

You can find the full details of how we compared methods in the workflow.

Citation

If you use LRGE in your research, please cite the following paper:

COMING SOON

Dependencies

~8–18MB
~254K SLoC