#parser #data #latin #greek #standards #xml #string

agldt

Tools for handling data conforming the standards of the Ancient Greek and Latin Dependency Treebank

2 releases

0.1.2 Dec 6, 2022
0.1.1 Dec 3, 2022
0.1.0 Dec 3, 2022
0.0.0 Dec 3, 2022

#2406 in Encoding

MIT license

140KB
250 lines

agldt

author: Caio Geraldes caio.geraldes@usp.br

Tools for parsing treebanks from AGLDT

Basic usage

use serde_xml_rs::from_str;
use std::fs::read_to_string;
use agldt::parser::*;

fn main() {
  let src = read_to_string("/path/to/agldt/tlg0007.tlg004.perseus-grc1.tb.xml").unwrap();
  let doc = from_str::<Treebank>(&preprocess(&src)).unwrap();

  assert_eq!(doc.count_words(), 9451);
  assert_eq!(doc.count_tokens(), 10709);
}

Description of parsing stages

Preprocessing

Pre-processes the source .xml code to allow for serialization of the treebank.

There are some oddities in the scheme used in AGLDT's xml header and body, that otherwise make serializing it to a struct quite messy. This is kind of a bodge, but should do the trick.

Oddities

The main oddity on AGLDT use of xml occurs inside the tag <respStmt>, where the tag <persName> might contain either a single string value or a series of tags:

<respStmt>
  <persName>Bridget Almas</persName>
  <resp>responsible for the annotation environment and cts:urn technology</resp>
  <address>Tufts University</address>
</respStmt>
<respStmt>
  <persName>
    <short>Vanessa Gorman</short>
    <name>Vanessa Gorman</name>
    <address>vbgorman@gmail.com</address>
    <uri>http://data.perseus.org/sosol/users/Vanessa%20Gorman</uri>
  </persName>
  <resp>annotator of the text</resp>
</respStmt>

To solve this oddity, we apply two regex replacements so as to move the <name> and <address> tags inside <persName>.

A handful of other oddities concern the use of the tags <primary>, <secondary> and <annotator> inside the tag <sentence>. Those are also removed by the regex in the current version.

Finally, the head value is sometimes an empty string, which is still an issue for me to serialize. As 0 is not used anywhere else, I replace empty strings for "0".

Serialization

Uses serde for serializing the data. I did my best to keep the metadata accessible, but there are still some missing fields that will later be included.

Dependencies

~3–4.5MB
~86K SLoC