2 releases
new 0.1.1 | Mar 5, 2025 |
---|---|
0.1.0 | Mar 5, 2025 |
#181 in Parser implementations
257 downloads per month
99KB
2.5K
SLoC
TSG - Transcript Segment Graph
TSG is a Rust library and command-line tool for creating, manipulating, and analyzing transcript segment graphs. It provides a comprehensive framework for modeling segmented transcript data, analyzing alternative splicing events, and working with genomic structural variants.
Features
- Parse and write TSG format files
- Build and manipulate transcript segment graphs
- Analyze paths and connectivity between transcript segments
- Support for various element types: nodes, edges, groups, and chains
- Export graphs to DOT format for visualization
- Traverse the graph to identify valid transcript paths
- Read identity tracking to ensure biological validity
- Build graphs from chains and validate path traversals
- Support for genomic coordinates with strand information
- Support for read evidence with types
Installation
Library
Add this to your Cargo.toml
:
[dependencies]
tsg = "0.1.0"
Command-line Tool
Install the CLI tool:
cargo install tsg-cli
Library Usage
Loading a TSG file
use tsg::graph::TSGraph;
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load graph from a TSG file
let graph = TSGraph::from_file("path/to/file.tsg")?;
// Access graph elements
println!("Number of nodes: {}", graph.get_nodes().len());
println!("Number of edges: {}", graph.get_edges().len());
// Export to DOT format for visualization
let dot = graph.to_dot()?;
std::fs::write("graph.dot", dot)?;
// Save modified graph
graph.write_to_file("output.tsg")?;
Ok(())
}
Creating a Graph Programmatically
use tsg::graph::{TSGraph, NodeData, EdgeData, StructuralVariant};
use bstr::BString;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut graph = TSGraph::new();
// Add nodes
let node1 = NodeData {
id: "node1".into(),
reference_id: "chr1".into(),
..Default::default()
};
let node2 = NodeData {
id: "node2".into(),
reference_id: "chr1".into(),
..Default::default()
};
graph.add_node(node1)?;
graph.add_node(node2)?;
// Add an edge between nodes
let edge = EdgeData {
id: "edge1".into(),
..Default::default()
};
graph.add_edge("node1".into(), "node2".into(), edge)?;
// Write to file
graph.write_to_file("new_graph.tsg")?;
Ok(())
}
Building a Graph from Chains
use tsg::graph::{TSGraph, Group};
use std::collections::HashMap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create chains
let chains = vec![
Group::Chain {
id: "chain1".into(),
elements: vec!["n1".into(), "e1".into(), "n2".into()],
attributes: HashMap::new(),
},
Group::Chain {
id: "chain2".into(),
elements: vec!["n2".into(), "e2".into(), "n3".into()],
attributes: HashMap::new(),
},
];
// Build graph from chains
let graph = TSGraph::from_chains(chains)?;
// Write to file
graph.write_to_file("output.tsg")?;
Ok(())
}
Finding Valid Paths Through the Graph
use tsg::graph::TSGraph;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let graph = TSGraph::from_file("transcript.tsg")?;
// Find all valid paths through the graph
let paths = graph.traverse()?;
for (i, path) in paths.iter().enumerate() {
println!("Path {}: {}", i+1, path);
}
Ok(())
}
CLI Usage
The TSG command-line tool provides a convenient interface for common operations:
# Display help
tsg-cli --help
# Parse and validate a TSG file
tsg-cli validate path/to/file.tsg
# Convert a TSG file to DOT format for visualization
tsg-cli to-dot path/to/file.tsg > graph.dot
# Extract statistics from a TSG file
tsg-cli stats path/to/file.tsg
# Find all paths through the graph
tsg-cli paths path/to/file.tsg
TSG File Format
The TSG format is a tab-delimited text format representing transcript assemblies as graphs.
Record Types
Each line in a TSG file starts with a letter denoting the record type:
H
- Header informationN
- Node definition (exon or transcript segment)E
- Edge definition (splice junction or structural variant)U
- Unordered group (set of elements)O
- Ordered group (path through the graph)C
- Chain (alternating nodes and edges)A
- Attribute for any element (metadata)
Conceptual Model
In the TSG model:
- Chains (C) are used to build the graph structure. They define the nodes and edges that make up the graph.
- Paths (O) are traversals through the constructed graph.
- The complete TSG is built by combining all nodes and edges from all chains.
- After constructing the graph from chains, paths can be defined to represent ways of traversing the graph.
This distinction is important: chains define what the graph is, while paths define ways to traverse the graph.
Example
# Header information
H TSG 1.0
H reference GRCh38
# Nodes (exons)
N n1 chr1:+:1000-1200,1500-1700 read1:SO,read2:SO ACGTACGT
N n2 chr1:+:2000-2200 read4:SO,read5:SO TGCATGCA
N n3 chr1:+:2500-2700 read1:IN,read2:IN,read3:IN,read4:IN CTGACTGA
# Edges (splice junctions)
E e1 n1 n2 chr1,chr1,1700,2000,splice
E e2 n2 n3 chr1,chr1,2200,2500,splice
# Chains (building the graph)
C chain1 n1 e1 n2 e2 n3
# Paths (traversals)
O transcript1 n1+ e1+ n2+ e2+ n3+
# Sets (grouping elements)
U exon_set n1 n2 n3
# Attributes (metadata)
A N n1 expression:f:10.5
A O transcript1 tpm:f:8.2
Node Format
Nodes represent exons or transcript segments with the format:
N <id> <genomic_location> <reads> [<seq>]
Where:
genomic_location
is in formatchromosome:strand:coordinates
(e.g.,chr1:+:1000-1200,1500-1700
)reads
is a comma-separated list of read IDs with types (e.g.,read1:SO,read2:IN
)- Read types include:
SO
: Source NodeIN
: Intermediary NodeSI
: Sink Node
Edge Format
Edges represent splice junctions or structural variants:
E <id> <source_id> <sink_id> <SV>
Where:
SV
is in formatreference_name1,reference_name2,breakpoint1,breakpoint2,sv_type
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Dependencies
~12MB
~201K SLoC