#ticker #symbols #stock #document #text-document #sniffer #extracting

bin+lib ticker-sniffer

A Rust crate for extracting multiple stock ticker symbols from a text document

25 releases

new 0.1.0-alpha9 Feb 13, 2025
0.1.0-alpha10 Feb 15, 2025
0.1.0-alpha.1n Jan 29, 2025

#62 in Compression

Download history 371/week @ 2025-01-14 1000/week @ 2025-01-21 325/week @ 2025-01-28 216/week @ 2025-02-04 1416/week @ 2025-02-11

3,004 downloads per month
Used in ticker-similarity-search

MIT license

675KB
1.5K SLoC

Ticker Sniffer (Work in Progress)

made-with-rust crates.io Documentation MIT licensed CI Pipeline

Ticker Sniffer is a Rust crate for extracting U.S. stock market ticker symbols from text. It analyzes content, identifies ticker references, and calculates their frequency, returning the results as a HashMap.

Use cases include extracting tickers from news articles and search queries.

Parsing is performed using a self-contained CSV file embedded in the binary using Gzip compression, which is generated automatically during the build process. No external CSV or file-reading dependencies are required in the final build, and it is fully compatible with WASM.

Install

cargo add ticker-sniffer

Usage

CLI

echo "E-commerce giant Amazon.com Inc. joined the blue-chip index, Dow Jones Industrial Average... Walmart, Amazon, Walmart" | RUST_LOG=debug cargo run

Output

AMZN: 2
WMT: 2
DIA: 1

Code Example

use ticker_sniffer::extract_tickers_from_text;
use ticker_sniffer::types::TickerSymbolFrequencyMap;

let text = r#"E-commerce giant Amazon.com Inc. joined the blue-chip index, 
Dow Jones Industrial Average, replacing drugstore operator Walgreens Boots 
Alliance on Feb 26. The reshuffle reflects the ongoing shift in economic 
power from traditional brick-and-mortar retail to e-commerce and 
technology-driven companies.

The inclusion of Amazon in the Dow marks a significant milestone in the 
recognition of the e-commerce giant's influence and its role in the broader 
market. The shift was prompted by Walmart's decision to execute a 3-to-1 
stock split, which has reduced its stock's weighting in the index.

The Dow is a price-weighted index. So, stocks that fetch higher prices are 
given more weight. Amazon's addition has increased consumer retail exposure 
within the index, alongside enhancing the representation of various other 
business sectors that Amazon engages in, including cloud computing, digital 
streaming, and artificial intelligence, among others.

Amazon took the 17th position in the index, while Walmart's weighting dropped 
to 26 from 17. UnitedHealth Group remained the most heavily weighted stock in 
the index. Amazon's entry into the Dow Jones is not just a symbolic change but 
a reflection of the evolving priorities and dynamics within the investment world.

It signals a broader recognition of the value and impact of technology and 
e-commerce sectors, encouraging investors to perhaps rethink their investment 
approaches in light of these trends."#;


// Setting this to false will increase false positives between nouns 
// (e.g., "apple") and company names (e.g., "Apple"), but might be better 
// suited for search query inputs
let is_case_sensitive_doc_parsing = true;

let results = extract_tickers_from_text(text, is_case_sensitive_doc_parsing).unwrap();

assert_eq!(
    results,
    TickerSymbolFrequencyMap::from([
        ("AMZN".to_string(), 6),
        ("WMT".to_string(), 2),
        ("DIA".to_string(), 4),
        ("WBA".to_string(), 1),
        ("UNH".to_string(), 1),
    ])
);

Design Overview

The text search engine employs a hybrid approach to identify company names and stock symbols in documents.

Initially, it filters out stop words and applies a sequence-based tokenizer to detect potential company names, preserving word order for contextual accuracy.

Simultaneously, a secondary tokenizer uses a Bag of Words approach to identify stock symbols, which may occasionally collide with stop words.

The engine calculates a ratio by comparing the number of company name matches to exact stock symbol matches found in the document.

Based on this ratio, it determines whether to include exact stock symbol matches in the results.

Regardless of the decision, the engine ensures that stock symbols are always matched, but the contextual importance of symbols is weighted by their relationship to identified company names.

Testing

When running tests, you can use the --nocapture flag to display output from tests in the console. This is particularly useful for this package as there are tests which process several files at once.

Running All Tests

cargo test -- --nocapture

Running Specific Tests

For example, to run the tokenizer_tests module in isolation with visible output:

cargo test --test tokenizer_tests  -- --nocapture

Benching

cargo bench

Debugging

RUST_LOG=debug cargo dev

Note: dev is an aliased Cargo command, as specified in the .cargo/config.toml file.

More information about Cargo aliases can be found at: https://doc.rust-lang.org/cargo/reference/config.html#configuration-format.

Lint

If clippy is not already installed:

rustup component add clippy
cargo clippy --fix

Suggestions:

cargo clippy -- -W clippy::all

Building CLI tool

Without Logging Support

cargo build --release --bin ticker-sniffer-cli

With Logging Support

cargo build --release --bin ticker-sniffer-cli --features="logger-support"

Maintainer Note

Currently, the build process generates temporary artifacts that are included in the build but are ignored by .git. However, Rust's package verification treats these files as uncommitted changes, which can cause issues when running cargo publish.

This approach ensures that a compressed form of the company_symbol_list.csv file is bundled correctly during the build process. However, it may require improvements to avoid conflicts with Cargo’s publishing workflow.

Known Issue During Publishing

When publishing the crate, you may encounter the following error:

error: 1 files in the working directory contain changes that were not yet committed into git:

embed/COMPRESSED_COMPANY_SYMBOL_LIST_BYTE_ARRAY.bin

to proceed despite this and include the uncommitted changes, pass the `--allow-dirty` flag

Workaround

Provided that embed/COMPRESSED_COMPANY_SYMBOL_LIST_BYTE_ARRAY.bin is the only file that is the only error file mentioned, you can safely proceed with the following.

To proceed with publishing, use the --allow-dirty flag:

cargo publish --allow-dirty

License

MIT License (c) 2025 Jeremy Harris.

Dependencies

~1.7–4MB
~57K SLoC