#graph #transcript #segment #path #node #edge #chains

app tsg

A tool to analyze and manipulate transcript segment graph (TSG)

2 releases

new 0.1.1 Mar 5, 2025
0.1.0 Mar 5, 2025

#181 in Parser implementations

Download history 257/week @ 2025-03-01

257 downloads per month

Apache-2.0

99KB
2.5K SLoC

TSG - Transcript Segment Graph

TSG is a Rust library and command-line tool for creating, manipulating, and analyzing transcript segment graphs. It provides a comprehensive framework for modeling segmented transcript data, analyzing alternative splicing events, and working with genomic structural variants.

Features

  • Parse and write TSG format files
  • Build and manipulate transcript segment graphs
  • Analyze paths and connectivity between transcript segments
  • Support for various element types: nodes, edges, groups, and chains
  • Export graphs to DOT format for visualization
  • Traverse the graph to identify valid transcript paths
  • Read identity tracking to ensure biological validity
  • Build graphs from chains and validate path traversals
  • Support for genomic coordinates with strand information
  • Support for read evidence with types

Installation

Library

Add this to your Cargo.toml:

[dependencies]
tsg = "0.1.0"

Command-line Tool

Install the CLI tool:

cargo install tsg-cli

Library Usage

Loading a TSG file

use tsg::graph::TSGraph;
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load graph from a TSG file
    let graph = TSGraph::from_file("path/to/file.tsg")?;

    // Access graph elements
    println!("Number of nodes: {}", graph.get_nodes().len());
    println!("Number of edges: {}", graph.get_edges().len());

    // Export to DOT format for visualization
    let dot = graph.to_dot()?;
    std::fs::write("graph.dot", dot)?;

    // Save modified graph
    graph.write_to_file("output.tsg")?;

    Ok(())
}

Creating a Graph Programmatically

use tsg::graph::{TSGraph, NodeData, EdgeData, StructuralVariant};
use bstr::BString;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut graph = TSGraph::new();

    // Add nodes
    let node1 = NodeData {
        id: "node1".into(),
        reference_id: "chr1".into(),
        ..Default::default()
    };

    let node2 = NodeData {
        id: "node2".into(),
        reference_id: "chr1".into(),
        ..Default::default()
    };

    graph.add_node(node1)?;
    graph.add_node(node2)?;

    // Add an edge between nodes
    let edge = EdgeData {
        id: "edge1".into(),
        ..Default::default()
    };

    graph.add_edge("node1".into(), "node2".into(), edge)?;

    // Write to file
    graph.write_to_file("new_graph.tsg")?;

    Ok(())
}

Building a Graph from Chains

use tsg::graph::{TSGraph, Group};
use std::collections::HashMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create chains
    let chains = vec![
        Group::Chain {
            id: "chain1".into(),
            elements: vec!["n1".into(), "e1".into(), "n2".into()],
            attributes: HashMap::new(),
        },
        Group::Chain {
            id: "chain2".into(),
            elements: vec!["n2".into(), "e2".into(), "n3".into()],
            attributes: HashMap::new(),
        },
    ];

    // Build graph from chains
    let graph = TSGraph::from_chains(chains)?;

    // Write to file
    graph.write_to_file("output.tsg")?;

    Ok(())
}

Finding Valid Paths Through the Graph

use tsg::graph::TSGraph;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let graph = TSGraph::from_file("transcript.tsg")?;

    // Find all valid paths through the graph
    let paths = graph.traverse()?;

    for (i, path) in paths.iter().enumerate() {
        println!("Path {}: {}", i+1, path);
    }

    Ok(())
}

CLI Usage

The TSG command-line tool provides a convenient interface for common operations:

# Display help
tsg-cli --help

# Parse and validate a TSG file
tsg-cli validate path/to/file.tsg

# Convert a TSG file to DOT format for visualization
tsg-cli to-dot path/to/file.tsg > graph.dot

# Extract statistics from a TSG file
tsg-cli stats path/to/file.tsg

# Find all paths through the graph
tsg-cli paths path/to/file.tsg

TSG File Format

The TSG format is a tab-delimited text format representing transcript assemblies as graphs.

Record Types

Each line in a TSG file starts with a letter denoting the record type:

  • H - Header information
  • N - Node definition (exon or transcript segment)
  • E - Edge definition (splice junction or structural variant)
  • U - Unordered group (set of elements)
  • O - Ordered group (path through the graph)
  • C - Chain (alternating nodes and edges)
  • A - Attribute for any element (metadata)

Conceptual Model

In the TSG model:

  1. Chains (C) are used to build the graph structure. They define the nodes and edges that make up the graph.
  2. Paths (O) are traversals through the constructed graph.
  3. The complete TSG is built by combining all nodes and edges from all chains.
  4. After constructing the graph from chains, paths can be defined to represent ways of traversing the graph.

This distinction is important: chains define what the graph is, while paths define ways to traverse the graph.

Example

# Header information
H  TSG  1.0
H  reference  GRCh38

# Nodes (exons)
N  n1  chr1:+:1000-1200,1500-1700  read1:SO,read2:SO  ACGTACGT
N  n2  chr1:+:2000-2200  read4:SO,read5:SO  TGCATGCA
N  n3  chr1:+:2500-2700  read1:IN,read2:IN,read3:IN,read4:IN  CTGACTGA

# Edges (splice junctions)
E  e1  n1  n2  chr1,chr1,1700,2000,splice
E  e2  n2  n3  chr1,chr1,2200,2500,splice

# Chains (building the graph)
C  chain1  n1 e1 n2 e2 n3

# Paths (traversals)
O  transcript1  n1+ e1+ n2+ e2+ n3+

# Sets (grouping elements)
U  exon_set  n1 n2 n3

# Attributes (metadata)
A  N  n1  expression:f:10.5
A  O  transcript1  tpm:f:8.2

Node Format

Nodes represent exons or transcript segments with the format:

N  <id>  <genomic_location>  <reads>  [<seq>]

Where:

  • genomic_location is in format chromosome:strand:coordinates (e.g., chr1:+:1000-1200,1500-1700)
  • reads is a comma-separated list of read IDs with types (e.g., read1:SO,read2:IN)
  • Read types include:
    • SO: Source Node
    • IN: Intermediary Node
    • SI: Sink Node

Edge Format

Edges represent splice junctions or structural variants:

E  <id>  <source_id>  <sink_id>  <SV>

Where:

  • SV is in format reference_name1,reference_name2,breakpoint1,breakpoint2,sv_type

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache-2.0

Dependencies

~12MB
~201K SLoC