#citation #doi #deduplication #bibliography #academic #parallel-processing

deref

A Rust crate for detecting and managing duplicate academic citations

1 unstable release

new 0.1.0 Jan 24, 2025

#412 in Text processing

GPL-3.0 license

35KB
510 lines

deref

Crates.io Documentation License: GPL-3.0

A Rust crate for detecting and managing duplicate academic citations. It provides robust deduplication of citations based on multiple criteria including DOIs, titles, journal names, and other metadata.

Features

  • Flexible deduplication based on multiple citation fields:
    • DOI matching with title similarity
    • Smart title comparison using Jaro-Winkler distance
    • Journal name and abbreviation matching
    • ISSN comparison
    • Volume and page number verification
  • Performance optimizations:
    • Optional year-based grouping
    • Parallel processing support
  • Robust text handling:
    • Unicode character normalization
    • HTML entity conversion
    • Special character handling
    • Case-insensitive comparisons

Installation

Add this to your Cargo.toml:

[dependencies]
deref = "0.1.0"

Quick Start

use deref::{Deduplicator, Citation, Author};

// Create some sample citations
let citations = vec![
    Citation {
        id: "1".to_string(),
        title: "Machine Learning Basics".to_string(),
        authors: vec![
            Author {
                family_name: "Smith".to_string(),
                given_name: "John".to_string(),
                affiliation: None,
            }
        ],
        doi: Some("10.1234/ml.2023.001".to_string()),
        year: Some(2023),
        ..Default::default()
    },
    // Possible duplicate with slightly different title
    Citation {
        id: "2".to_string(),
        title: "Machine Learning Basics.".to_string(), // Notice the period
        authors: vec![
            Author {
                family_name: "Smith".to_string(),
                given_name: "John".to_string(),
                affiliation: None,
            }
        ],
        doi: Some("10.1234/ml.2023.001".to_string()),
        year: Some(2023),
        ..Default::default()
    },
];

// Create a deduplicator with default settings
let deduplicator = Deduplicator::new();

// Find duplicate citations
let duplicate_groups = deduplicator.find_duplicates(&citations);

// Process results
for group in duplicate_groups {
    println!("Original: {}", group.unique.title);
    for duplicate in group.duplicates {
        println!("  Duplicate: {}", duplicate.title);
    }
}

Advanced Configuration

The deduplicator can be configured with custom settings:

use deref::{Deduplicator, DeduplicatorConfig};

let config = DeduplicatorConfig {
    group_by_year: false,     // Disable year-based grouping
    run_in_parallel: true,    // Enable parallel processing
};

let deduplicator = Deduplicator::with_config(config);

Deduplication Criteria

Citations are considered duplicates based on the following criteria:

With DOIs:

  • Matching DOIs and high title similarity (≥ 0.85)
  • Matching journal names or ISSNs

Without DOIs:

  • Very high title similarity (≥ 0.93)
  • Matching volume or page numbers
  • Matching journal names or ISSNs

Performance Considerations

  • Enable group_by_year for large datasets (default: enabled)
  • Use run_in_parallel for faster processing of large datasets with year grouping
  • Disable year grouping only for small datasets or when year matching isn't important

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Changelog

0.1.0

  • Initial release
  • Basic deduplication functionality
  • Year-based grouping
  • Parallel processing support

Dependencies

~3.5–4.5MB
~80K SLoC