Language data and XML: archiving and interoperability

Language data and XML:archiving and interoperability Simon Musgrave Linguistics Program Monash University (Simon.Musgrave@arts.monash.edu.au)

Language documentation • Language documentation produces large quantities of text • Transcribed language events • associated annotations • lexica / dictionaries • analyses • ethnographic notes • ……. • There is no standard software tool used by linguists • Use of proprietary software results in file formats with limited portability DRH 2003 - Cheltenham 2/9/03

Advantages of XML: Archiving • UNICODE compatibility assured • Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists • Explicit coding of data model • Generic file format assures better portability and lifespan DRH 2003 - Cheltenham 2/9/03

Building an archive • Addition of data to an XML archive should be automated • This implies the existence of transformation scripts to move data between formats • Creating these scripts is work which has to be done • It can have a second benefit DRH 2003 - Cheltenham 2/9/03

Advantages of XML: Interoperability • Members of a research team may use different software running on different platforms • Problems can arise in sharing data • An important use of XML is as an interchange format • Transformation scripts created for archiving can also be used for sharing data DRH 2003 - Cheltenham 2/9/03

Data structures - 1 • Researchers may not agree on common data structures • They are used to working with one tool in one particular way • Their interests are different • Even if they agree on a data structure for current work, heritage data may have to be imported to the archive DRH 2003 - Cheltenham 2/9/03

Data structures - 2 • Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data • We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure • Where possible, correspondences will be made between the information in different input files DRH 2003 - Cheltenham 2/9/03

Example: Dictionary files • The prototype implementation of the process uses a simple type of information: dictionary files • Source 1 is a FilemakerPro database of lexical material from the language Nusalaut • Source 2 is a table in an Access database containing data from several languages DRH 2003 - Cheltenham 2/9/03

Source 1 DRH 2003 - Cheltenham 2/9/03

Source 2 DRH 2003 - Cheltenham 2/9/03

Process overview DRH 2003 - Cheltenham 2/9/03

Stage 1 – txt to xml • Data exported from database as delimited text file • A document type description (DTD) is created for each source file • This replicates the existing data structure, possibly with additions • A Perl script reads data from the txt file and adds tags based on the DTD DRH 2003 - Cheltenham 2/9/03

Sample: specific XML DRH 2003 - Cheltenham 2/9/03

Stage 1 – Why? • Newer versions of commercial software offer an export to XML facility • Importing data from a normalized database often means having access to data from more than one table • XSLT takes a single input file • Perl (or an equivalent) does not have this limitation • Type conversion can be done using Perl DRH 2003 - Cheltenham 2/9/03

Stage 2 – XML1 to XML2 • DTD for archive file has a place for all information in all input files • More structure imposed at this level • Stage 1 used only elements • Stage 2 uses attributes, mainly for metadata • “Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs • Date stamping done at this stage DRH 2003 - Cheltenham 2/9/03

Sample: General XML 1 DRH 2003 - Cheltenham 2/9/03

Sample: General XML 2 DRH 2003 - Cheltenham 2/9/03

Exporting Data • XSLT with <xsl:output method=“text”/> • The only complication is undoing “pseudo-normalization” DRH 2003 - Cheltenham 2/9/03

A more complex problem: aligned interlinear text • Important way of presenting data for linguists • Various lines of annotation, different levels have different alignment patterns DRH 2003 - Cheltenham 2/9/03

The Bird, Bow & Hughes Model • Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop • A general data model for representing this type of information • Four levels: • Text • Phrase • Word • Morpheme DRH 2003 - Cheltenham 2/9/03

XML model for aligned text DRH 2003 - Cheltenham 2/9/03

Aligned text: Problems • Various types of input: • Text strings with space and/or tabs (Shoebox) • Formatted text (e.g. Word tables) • Structured data (e.g. Spinoza database) • Type of processing varies • Text strings need a lot of parsing • Structured data needs access to multiple tables • Ideally, time alignment to AV source should be included also DRH 2003 - Cheltenham 2/9/03

What is gained • Interoperability within the project • Data can be imported to the archive file from one format and exported to another format • Interoperability outside the project • People who wish to share data with a group will define transformations from their data formats • A bottom-up approach to developing standards • Improved data modeling • Encourages members of the project to revise their data formats • Gives us help in developing high-level models for linguistic data DRH 2003 - Cheltenham 2/9/03

Future work • Processing aligned text formats • Using schemas rather than DTDs: data validation • Improved version control, especially checking for duplicate or conflicting records DRH 2003 - Cheltenham 2/9/03

Some details • This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora • Funding: • Hans Rausing Endangered Languages Project • Australian Research Council • Faculty of Arts, Monash University • Contacts: • maluku@arts.monash.edu.au • http://www.arts.monash.edu.au/ling/maluku DRH 2003 - Cheltenham 2/9/03

Language data and XML: archiving and interoperability