1 unstable release
new 0.1.0 | Jan 24, 2025 |
---|
#412 in Text processing
35KB
510 lines
deref
A Rust crate for detecting and managing duplicate academic citations. It provides robust deduplication of citations based on multiple criteria including DOIs, titles, journal names, and other metadata.
Features
- Flexible deduplication based on multiple citation fields:
- DOI matching with title similarity
- Smart title comparison using Jaro-Winkler distance
- Journal name and abbreviation matching
- ISSN comparison
- Volume and page number verification
- Performance optimizations:
- Optional year-based grouping
- Parallel processing support
- Robust text handling:
- Unicode character normalization
- HTML entity conversion
- Special character handling
- Case-insensitive comparisons
Installation
Add this to your Cargo.toml
:
[dependencies]
deref = "0.1.0"
Quick Start
use deref::{Deduplicator, Citation, Author};
// Create some sample citations
let citations = vec![
Citation {
id: "1".to_string(),
title: "Machine Learning Basics".to_string(),
authors: vec![
Author {
family_name: "Smith".to_string(),
given_name: "John".to_string(),
affiliation: None,
}
],
doi: Some("10.1234/ml.2023.001".to_string()),
year: Some(2023),
..Default::default()
},
// Possible duplicate with slightly different title
Citation {
id: "2".to_string(),
title: "Machine Learning Basics.".to_string(), // Notice the period
authors: vec![
Author {
family_name: "Smith".to_string(),
given_name: "John".to_string(),
affiliation: None,
}
],
doi: Some("10.1234/ml.2023.001".to_string()),
year: Some(2023),
..Default::default()
},
];
// Create a deduplicator with default settings
let deduplicator = Deduplicator::new();
// Find duplicate citations
let duplicate_groups = deduplicator.find_duplicates(&citations);
// Process results
for group in duplicate_groups {
println!("Original: {}", group.unique.title);
for duplicate in group.duplicates {
println!(" Duplicate: {}", duplicate.title);
}
}
Advanced Configuration
The deduplicator can be configured with custom settings:
use deref::{Deduplicator, DeduplicatorConfig};
let config = DeduplicatorConfig {
group_by_year: false, // Disable year-based grouping
run_in_parallel: true, // Enable parallel processing
};
let deduplicator = Deduplicator::with_config(config);
Deduplication Criteria
Citations are considered duplicates based on the following criteria:
With DOIs:
- Matching DOIs and high title similarity (≥ 0.85)
- Matching journal names or ISSNs
Without DOIs:
- Very high title similarity (≥ 0.93)
- Matching volume or page numbers
- Matching journal names or ISSNs
Performance Considerations
- Enable
group_by_year
for large datasets (default: enabled) - Use
run_in_parallel
for faster processing of large datasets with year grouping - Disable year grouping only for small datasets or when year matching isn't important
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
License
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
Changelog
0.1.0
- Initial release
- Basic deduplication functionality
- Year-based grouping
- Parallel processing support
Dependencies
~3.5–4.5MB
~80K SLoC