#vcf #bioinformatics #pangenome #pangenomics

app forgers

VCF manipulation based on FORGe ranking

1 unstable release

0.1.1 Aug 26, 2024

#70 in Biology

MIT license

37KB
778 lines

Background

FORGe [Pritt2018] is a model and a software tool for variant prioritisation and filtration to be included in the pangenome reference. It scores each variant's "expected positive and negative impacts on alignment accuracy and computational overhead" based on population frequency, graph repetitiveness, and/or variant proximity. Variants are then ranked by these scores, and a fraction of them is used to augment the reference genome.

FORGe implementation is, by designed, compatible with HISAT2 or Bowtie workflows and cannot be integrated into other graph construction workflows, such as PGGB or vg out of the box. It also requires the input file describing the variants to be in 1ksnp format which is not as pervasive and straightforward as VCF and imposes an extra step to convert VCF to 1ksnp.

The final ranking file generated by rank.py is not in a standard format either, such as a sorted or filtered VCF file. This is where forgers comes into play providing the necessary logic to incorporate the FORGe model into broader workflows.

Introduction

This tool, named forgers (short for forge-rs), aims to apply FORGe model to input VCF files and support VCF manipulation operations based on FORGe ranking. One of the design decision for forgers is to work seamlessly work with tools such as bcftools enabling the user to pipe VCF output of these tools to forgers or vice versa to create a more complex variant filtration pipeline.

Usage

Currently, forgers supports two subcommands: filter, and resolve.

Filter

Filter and/or annotate VCF records based on FORGe ranking

USAGE:
    forgers filter [FLAGS] [OPTIONS] [input]

FLAGS:
    -a, --annotate    Annotate the filtered records with FORGe rank
    -g, --gzip        Gzip output, detected by file extension by default
    -h, --help        Prints help information
    -V, --version     Prints version information
    -v, --verbose     Enable verbose mode

OPTIONS:
    -f, --forge-rank <forge-rank>    FORGe rank file [default: ordered.txt]
    -k, --info-key <info-key>        Annotate key for INFO field [default: FORGE]
    -o, --output <output>            Output file, stdout if not specified [default: -]
    -t, --top <top>                  Top fraction of records to keep, keeps all by default [default: 1.0]

ARGS:
    <input>    Input VCF file, stdin if not specified [default: -]

Resolve

Resolve overlapping variants based on FORGe ranking; i.e. remove a cluster of variants when they are conflicting and replace them with one with higher ranking. It considers the phasing information when available to determine whether two overlapping variants are co-occurrent in any sample.

USAGE:
    forgers resolve [FLAGS] [OPTIONS] [input]

FLAGS:
    -g, --gzip       Gzip output, detected by file extension by default
    -h, --help       Prints help information
    -V, --version    Prints version information
    -v, --verbose    Enable verbose mode

OPTIONS:
    -f, --forge-rank <forge-rank>    FORGe rank file [default: ordered.txt]
    -o, --output <output>            Output file, stdout if not specified [default: -]

ARGS:
    <input>    Input VCF file, stdin if not specified [default: -]

Dependencies

~6.5MB
~123K SLoC