21 releases

new 0.8.5 Apr 6, 2025
0.7.7 Dec 27, 2024
0.7.5 Sep 20, 2024
0.7.2 May 28, 2024
0.5.5 Mar 4, 2022

#49 in Biology

Download history 300/week @ 2024-12-22 125/week @ 2024-12-29 115/week @ 2025-01-05 32/week @ 2025-01-12 9/week @ 2025-02-16 14/week @ 2025-02-23 25/week @ 2025-03-02 1/week @ 2025-03-09 4/week @ 2025-03-16 2/week @ 2025-03-23 550/week @ 2025-03-30 213/week @ 2025-04-06

769 downloads per month

MIT and GPL-3.0 licenses

5MB
7.5K SLoC

Rust 6K SLoC // 0.1% comments Shell 1K SLoC // 0.1% comments Perl 225 SLoC // 0.1% comments

nwr

Publish Build Codecov Crates.io Lines of code

nwr is a command line tool for working with NCBI taxonomy, NeWick files and assembly Reports, written in Rust.

Install

Current release: 0.8.5

cargo install nwr

# or
cargo install --path . --force # --offline

# Concurrent tests may trigger sqlite locking
cargo test -- --test-threads=1

# build under WSL 2
mkdir -p /tmp/cargo
export CARGO_TARGET_DIR=/tmp/cargo
cargo build

# build for CentOS 7
# rustup target add x86_64-unknown-linux-gnu
# pip3 install cargo-zigbuild
cargo zigbuild --target x86_64-unknown-linux-gnu.2.17 --release
ll $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/

nwr help

$ nwr help
`nwr` is a command line tool for working with NCBI taxonomy, Newick files and assembly reports

Usage: nwr [COMMAND]

Commands:
  download     Download the latest releases of `taxdump` and assembly reports
  txdb         Init the taxonomy database
  ardb         Init the assembly database
  info         Information of Taxonomy ID(s) or scientific name(s)
  lineage      Output the lineage of the term
  member       List members (of certain ranks) under ancestral term(s)
  append       Append fields of higher ranks to a TSV file
  restrict     Restrict taxonomy terms to ancestral descendants
  common       Output the common tree of terms
  template     Create dirs, data and scripts for a phylogenomic research
  kb           Prints docs (knowledge bases)
  seqdb        Init the seq database
  data         Newick data commands
  ops          Newick operation commands
  viz          Newick visualization commands
  mat          Distance matrix commands
  pl-condense  Pipeline - condense subtrees based on taxonomy
  help         Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version


Subcommand groups:

* Database
    * download / txdb / ardb
* Taxonomy
    * info / lineage / member / append / restrict / common
* Assembly
    * template / kb / seqdb
* Newick
    * data label / data stat / data distance
    * Operations
        * ops order / ops rename / ops replace / ops topo / ops subtree /
          ops prune / ops  reroot
    * Visualization
        * viz indent / viz comment / viz tex
    * pl-condense
* Distance matrix
    * mat pair / mat phylip / mat format / mat subset / mat compare

$ nwr data help
Newick data commands

Usage: nwr data <COMMAND>

Commands:
  label     Labels in the Newick file
  stat      Statistics about the Newick file
  distance  Output a TSV/phylip file with distances between all named nodes
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

$ nwr ops help
Newick operation commands

Usage: nwr ops <COMMAND>

Commands:
  order    Order nodes in a Newick file
  rename   Rename named/unnamed nodes in a Newick file
  replace  Replace node names/comments in a Newick file
  subtree  Extract a subtree
  topo     Topological information of the Newick file
  prune    Remove nodes from the Newick file
  reroot   Place the root in the middle of the desired node and its parent
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

$ nwr viz help
Newick visualization commands

Usage: nwr viz <COMMAND>

Commands:
  indent   Indent the Newick file
  comment  Add comments to node(s) in a Newick file
  tex      Visualize the Newick tree via LaTeX
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

$ nwr mat help
Distance matrix commands

Usage: nwr mat <COMMAND>

Commands:
  compare  Compare two distance matrices
  format   Convert between different PHYLIP matrix formats
  pair     Convert a PHYLIP distance matrix to pairwise distances
  phylip   Convert pairwise distances to a phylip distance matrix
  subset   Extract a submatrix from a PHYLIP matrix using a list of names
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Examples

Usage of each command

For practical uses of nwr and other awesome companions, follow this page.

nwr download

nwr txdb

nwr info "Homo sapiens" 4932

nwr lineage "Homo sapiens"
nwr lineage 4932

nwr restrict "Vertebrata" -c 2 -f tests/nwr/taxon.tsv
##sci_name       tax_id
#Human   9606

nwr member "Homo"

nwr append tests/nwr/taxon.tsv -c 2 -r species -r family --id

nwr ardb
nwr ardb --genbank

nwr common "Escherichia coli" 4932 Drosophila_melanogaster 9606 Mus_musculus

# rm ~/.nwr/*.dmp

Development

cargo test --color=always --package nwr --test cli_nwr command_template -- --show-output

# debug mode has a slow connection
cargo run --release --bin nwr download

# tests/nwr/
cargo run --bin nwr txdb -d tests/nwr/

cargo run --bin nwr info -d tests/nwr/ --tsv Viruses "Actinophage JHJ-1" "Bacillus phage bg1"

cargo run --bin nwr common -d tests/nwr/ "Actinophage JHJ-1" "Bacillus phage bg1"

cargo run --bin nwr template tests/assembly/Trichoderma.assembly.tsv --ass -o stdout

seqdb

export SPECIES="$HOME/data/Archaea/Protein/Sulfolobus_acidocaldarius"

cargo run --bin nwr seqdb -d ${SPECIES} --init --strain

cargo run --bin nwr seqdb -d ${SPECIES} \
    --size <(
        hnsm size ${SPECIES}/pro.fa.gz
    ) \
    --clust

cargo run --bin nwr seqdb -d ${SPECIES} \
    --anno <(
        gzip -dcf "${SPECIES}"/anno.tsv.gz
    ) \
    --asmseq <(
        gzip -dcf "${SPECIES}"/asmseq.tsv.gz
    )

cargo run --bin nwr seqdb -d ${SPECIES} --rep f1="${SPECIES}"/fam88_cluster.tsv

echo "
    SELECT
        *
    FROM asm
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
    SELECT
        COUNT(distinct asm_seq.asm_id)
    FROM asm_seq
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
.header ON
    SELECT
        'species' AS species,
        COUNT(distinct asm_seq.asm_id) AS strain,
        COUNT(*) AS total,
        COUNT(distinct rep_seq.seq_id) AS dedup,
        COUNT(distinct rep_seq.rep_id) AS rep
    FROM asm_seq
    JOIN rep_seq ON asm_seq.seq_id = rep_seq.seq_id
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite


Newick files and LaTeX

For more detailed usages, check this file.

Get data from the tree

# List all names
nwr data label tests/newick/hg38.7way.nwk

# The intersection between the nodes in the tree and the provided
nwr data label tests/newick/hg38.7way.nwk -r "^ch" -n Mouse -n foo
nwr data label tests/newick/catarrhini.nwk -n Homo -n Pan -n Gorilla -M
# Is Pongo the sibling of Homininae?
nwr data label tests/newick/catarrhini.nwk -n Homininae -n Pongo -DM
# All leaves belong to Hominidae
nwr data label tests/newick/catarrhini.nwk -t Hominidae -I

nwr data label tests/newick/catarrhini.nwk -c dup
nwr data label tests/newick/catarrhini.comment.nwk -c full

nwr data stat tests/newick/hg38.7way.nwk

# Various distances
nwr data distance -m root -I tests/newick/catarrhini.nwk
nwr data distance -m parent -I tests/newick/catarrhini.nwk
nwr data distance -m pairwise -I tests/newick/catarrhini.nwk
nwr data distance -m lca -I tests/newick/catarrhini.nwk

nwr data distance -m root -L tests/newick/catarrhini_topo.nwk

# Phylip distance matrix
nwr data distance -m phylip tests/newick/catarrhini.nwk

Operations of the tree

echo "((A,B),C);" | nwr ops order --ndr stdin
nwr ops order --nd tests/newick/hg38.7way.nwk

nwr ops order --list tests/newick/abcde.list tests/newick/abcde.nwk

# gene tree as the order of species tree
nwr ops order tests/newick/pmxc.nwk \
    --list <(nwr data label tests/newick/species.nwk)

nwr ops rename tests/newick/abc.nwk -n C -r F -l A,B -r D

nwr ops replace tests/newick/abc.nwk tests/newick/abc.replace.tsv
nwr ops replace tests/newick/abc.nwk tests/newick/abc3.replace.tsv

nwr ops topo tests/newick/catarrhini.nwk

# The behavior is very similar to `nwr label`, but outputs a subtree instead of labels
nwr ops subtree tests/newick/hg38.7way.nwk -n Human -n Rhesus -r "^ch" -M

# Condense the subtree to a node
nwr ops subtree tests/newick/hg38.7way.nwk -n Human -n Rhesus -r "^ch" -M -c Primates

nwr ops subtree tests/newick/catarrhini.nwk -t Hominidae

nwr ops prune tests/newick/catarrhini.nwk -n Homo -n Pan

echo "((A:1,B:1)D:1,C:1)E;" |
    nwr ops reroot stdin -n B
nwr ops reroot tests/newick/catarrhini_wrong.nwk -n Cebus

nwr ops reroot tests/newick/bs.nw -n C

nwr viz tex tests/newick/bs.nw | tectonic -
mv texput.pdf bs.pdf
nwr ops reroot tests/newick/bs.nw -n C | nwr viz tex stdin | tectonic -
mv texput.pdf bs.reroot.pdf

cargo run --bin nwr pl-condense tests/newick/catarrhini.nwk -r family

Visualization of the tree

nwr viz indent tests/newick/hg38.7way.nwk --text ".   "

echo "((A,B),C);" |
    nwr viz comment stdin -n A -n C --color green |
    nwr viz comment stdin -l A,B --dot

tectonic doc/template.tex

echo "((A[color=green],B)[dot=black],C[color=green]);" |
    cargo run --bin nwr viz comment stdin -r "color="

nwr viz tex tests/newick/catarrhini.nwk -o output.tex
tectonic output.tex

nwr viz tex --bl tests/newick/hg38.7way.nwk

nwr viz tex --forest --bare tests/newick/test.forest

nwr viz common "Escherichia coli" 4932 Drosophila_melanogaster 9606 "Mus musculus" |
    nwr viz tex --bare stdin

Matrix commands

hnsm mat phylip tests/clust/IBPA.fa.tsv

hnsm mat pair tests/clust/IBPA.phy

cargo run --bin hnsm mat format tests/clust/IBPA.phy

cargo run --bin hnsm mat subset tests/clust/IBPA.phy tests/clust/IBPA.list

hnsm distance tests/clust/IBPA.fa -k 7 -w 1 |
    hnsm mat phylip stdin -o tests/clust/IBPA.71.phy

cargo run --bin hnsm mat compare tests/clust/IBPA.phy tests/clust/IBPA.71.phy --method all
# Sequences in matrices: 10 and 10
# Common sequences: 10
# Method  Score
# pearson 0.935803
# spearman        0.919631
# mae     0.113433
# cosine  0.978731
# jaccard 0.759106
# euclid  1.229844

Database schema

brew install k1LoW/tap/tbls

tbls doc sqlite://./tests/nwr/taxonomy.sqlite doc/txdb

tbls doc sqlite://./tests/nwr/ar_refseq.sqlite doc/ardb

txdb

ardb

Dependencies

~74MB
~1M SLoC