5 releases

new 0.2.1 Jan 28, 2025
0.2.0 Jan 28, 2025
0.1.2 Jan 25, 2025
0.1.1 Jan 25, 2025
0.1.0 Jan 25, 2025

#61 in Science

Download history 270/week @ 2025-01-21

270 downloads per month

GPL-3.0 license

120KB
2K SLoC

biblib

Crates.io Documentation License: GPL-3.0

A comprehensive Rust library for parsing, managing, and deduplicating academic citations. biblib provides robust support for multiple citation formats, intelligent deduplication, and extensive metadata handling.

Note: This crate is under active development. There may be some breaking changes between versions. Please check the changelog before upgrading.

Features

Multiple Format Support

  • RIS (Research Information Systems)

    • Full tag support
    • Author name parsing
    • Journal abbreviations
  • PubMed/MEDLINE

    • Complete field coverage
    • MeSH terms support
    • Affiliation handling
  • EndNote XML

    • Full XML schema support
    • Unicode handling
    • Custom field mapping
  • CSV with Custom Mappings

    • Configurable headers
    • Multiple delimiters
    • Flexible field mapping

Intelligent Deduplication

  • DOI-based matching
  • Smart title comparison using Jaro-Winkler distance
  • Journal name/abbreviation matching
  • Configurable matching thresholds
  • Year-based grouping for performance
  • Parallel processing support

Rich Metadata Support

  • Complete author information with affiliations
  • Journal details (name, abbreviation, ISSN)
  • DOIs and other identifiers (PMID, PMC ID)
  • Comprehensive citation metadata

Installation

Add this to your Cargo.toml:

[dependencies]
biblib = "0.2.0"  # All features enabled by default

Or select specific features:

[dependencies]
biblib = { version = "0.2.0", default-features = false, features = ["csv", "ris"] }

Available features:

  • csv - CSV format support
  • pubmed - PubMed/MEDLINE format support
  • xml - EndNote XML support (requires quick-xml)
  • ris - RIS format support
  • dedupe - Citation deduplication (requires rayon and strsim)

All features are enabled by default. Disable default-features to select specific ones.

Quick Start

Basic Citation Parsing

use biblib::{CitationParser, RisParser};

// Parse RIS format
let input = r#"TY  - JOUR
TI  - Example Article
AU  - Smith, John
ER  -"#;

let parser = RisParser::new();
let citations = parser.parse(input).unwrap();
println!("Title: {}", citations[0].title);

Citation Deduplication

use biblib::dedupe::{Deduplicator, DeduplicatorConfig};

// Configure deduplication
let config = DeduplicatorConfig {
    group_by_year: true,
    run_in_parallel: true,
};

let deduplicator = Deduplicator::with_config(config);
let duplicate_groups = deduplicator.find_duplicates(&citations).unwrap();

for group in duplicate_groups {
    println!("Original: {}", group.unique.title);
    for duplicate in group.duplicates {
        println!("  Duplicate: {}", duplicate.title);
    }
}

CSV with Custom Mappings

use biblib::csv::{CsvParser, CsvConfig};

let mut config = CsvConfig::new();
config.set_header_mapping("title", vec!["Article Name".to_string()])
     .set_delimiter(b';');

let parser = CsvParser::with_config(config);
let citations = parser.parse("Article Name;Author;Year\nExample Paper;Smith J;2023").unwrap();

Supported Fields

Field Description RIS PubMed EndNote XML CSV
Title Work title
Authors Author names and affiliations
Journal Journal name and abbreviation
Year Publication year
Volume Journal volume
Issue Journal issue
Pages Page range
DOI Digital Object Identifier
PMID PubMed ID -
PMC ID PubMed Central ID
Abstract Abstract text
Keywords Keywords/tags
Language Publication language
Publisher Publisher information -
URLs Related URLs -
ISSN International Standard Serial Number
MeSH Terms Medical Subject Headings - - -

Advanced Usage

Customizing Deduplication

use biblib::dedupe::{Deduplicator, DeduplicatorConfig};

// Fine-tune deduplication settings
let config = DeduplicatorConfig {
    group_by_year: true,     // Enable year-based grouping
    run_in_parallel: true,   // Enable parallel processing
};

let deduplicator = Deduplicator::with_config(config);

Error Handling

use biblib::{CitationParser, RisParser, CitationError};

let result = RisParser::new().parse("invalid input");
match result {
    Ok(citations) => println!("Parsed {} citations", citations.len()),
    Err(CitationError::InvalidFormat(msg)) => eprintln!("Parse error: {}", msg),
    Err(e) => eprintln!("Other error: {}", e),
}

Performance Considerations

  • Use year-based grouping for large datasets (> 1000 citations)
  • Enable parallel processing for better deduplication performance
  • Consider using CSV format for very large datasets
  • Pre-process and normalize data when possible

Thread Safety

All parser implementations are thread-safe and can be shared between threads. The deduplication engine supports parallel processing through the run_in_parallel configuration option.

Contributing

We welcome contributions! Please feel free to submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

Make sure to update tests as appropriate and follow the existing code style.

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Project Status

This crate is under active. There may be some breaking changes.

Support

Dependencies

~4–6.5MB
~103K SLoC