Encoding language corpora: current trends and future directions

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana, Slovenia tomaz.erjavec@ijs.si, http://nl.ijs.si/et/ National Institute for Japanese Language 2006-09-28

Overview • History and current practices in corpus encoding: TEI P4, CES • Open issues: multiple annotations, metadata and analytical tools • Future directions: TEI P5, ISO TC 37 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

I. Some history • 80’s: corpora (and other language resources) encoded in idiosyncratic formats, usu. bound to specific tools • corpora expensive to produce but • difficult exchange and reuse • quickly became obsolete • to address these problems, the Text Encoding Initiative is established in 1987 • initiative comes from humanities computing: sponsorship by ACH, ALLC, ACL Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Text Encoding Initiative • TEI is the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it • intended for processing and analysis of any type of text, in any language • main result: the TEI Guidelines for Electronic Text Encoding and Interchange • SGML was chosen as the underlying standard for the TEI Guidelines. • drafts: TEI P1 (1990), TEI P2 (1993) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI P3 and P4 • the third version of the Guidelines, TEI P3 (1994) published in two substantial green volumes (1200pp) and soon also on the Web. • A major revision, the TEI P4published in 2002 • TEI P4 addresses the following issues: • error correction • provides equal support for XML and SGML • retains backward compatibility with TEI P3 • Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI web pages Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The TEI scheme • TEI P4 consists of the written guidelines + a set of DTD fragments • to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are combined: • core tagset (always present)includes the TEI header • base tagsets (specific text types)e.g. prose, dictionaries, drama • additional tagsets (particular analyses)e.g. dates&times, certainty, simple linguistic analysis • user extensions, which extend or modify the TEI • a widely used parameterisation of TEI: TEI Lite Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

What is good about TEI • is a “standard” • offers a rich vocabulary of tags with extensive documentation • can be extended and modified • many best practice scenarios • software and user community support (tei-c web pages & tei-l mailing list) • tutorials teaching TEI Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

What is bad about TEI • steep learning curve (difficult to start using it) • TEI is general, so tags are often too generic for the needs of particular projects; also, too deeply nested (tag bloat) • it is often not clear to how encode a particular phenomenon (more than one possibility exists) • while TEI is modular, it will still allow lots of tags that a project (encoder) has no need for • never really became accepted in the comp. ling. community • some areas missing or not up-to date: computational lexicons, terminological databases, complex linguistic annotations Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI for corpus encoding • base module: TEI.prose • additional modules: • TEI.corpusadditional tags in the header • TEI.analysis tags for simple analytic mechanisms • TEI.linking tags for linking, segmentation, and alignment • TEI.fs tags for feature structure analysis Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example annotated text <seg id="orwl.en.24" corresp="orwl.sl.24"> <s id="Oen.1.1.4.5"> <c type="open" ctag='"'>"</c> <w ana="Af" lemma="big">Big</w> <w ana="Ncms" lemma="brother">Brother</w> <w ana="Vaip3s" lemma="be">is</w> <w ana="Vmpp" lemma="watch">watching</w> <w ana="Pp2" lemma="you">you</w> <c ctag='"'>"</c> <w ana="Dd" lemma="the">the</w> <w ana="Ncns" lemma="caption">caption</w> <w ana="Vmis" lemma="say">said</w> <c ctag=".">.</c> </s> </seg> Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example morphosyntactic encoding In text: <w ana="Ncfda" lemma="ženska">ženskama</w> In the MSD specification: <fsLib> <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/> <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/> <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/> ... </fsLib> <fLib> <f id="N1.c" select="en ro sl cs bg et hu hr" name="Type"> <sym value="common"/> </f> <f id="N1.p" select="en ro sl cs bg et hu hr" name="Type"> <sym value="proper"/> </f> <f id="N2.m" select="en ro sl cs bg hr" name="Gender"> <sym value="masculine"/> </f> <f id="N2.f" select="en ro sl cs bg hr" name="Gender"> <sym value="feminine"/> </f> <f id="N2.n" select="en ro sl cs bg hr" name="Gender"> <sym value="neuter"/> </f> ... </fLib> Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

CES: the Corpus Encoding Standard • CES was developed in the scope of EU EAGLES, the Expert Advisory Group on Language Engineering Standards (1996) • CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3 • XCES (2002) is the XML version of CES • (X)CES has been used in a number of corpus projects, mainly because it is simpler to use and understand than the full TEI • however, there is not prescribed way how to modify or extend it • also, less strictly maintained than the TEI Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

II. Open issues • multiple annotations • metadata • corpus analytical tools Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Multiple annotations More and more linguistic annotation is being added to the data, e.g. • sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic structure, co-reference annotation (anaphora), word-sense information • also rhetorical structure: quoted speech, paragraphs, lists, … • even more annotation can be added to multimodal data, e.g. speech signals • furthermore, the same level of analysis can be marked-up by more than one tool / annotator Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

How to combine these annotations? • simply have distinct tags & attributes for each of the phenomena covered • easy to understand and hand-edit • easy to validate • easy to process • but XML requires a tree-structure; what if the tags do not nest properly? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Crossing hierarchies • simple example - page breaks v.s. paragraph boundaries:<page> … …. </page> … • a well known problem for XML encoding, but with multiple annotations it is now becoming more severe Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Solutions to crossing hierarchies Discussed in TEI chapter 14 “Linking, Segmentation, and Alignment”: • split elements:<page broken=“yes” id=“p1” next=“p2”>…</page> <page broken=“yes” id=“p2” prev=“p1”>…</page> • “milestones” i.e. empty elements:<page/> … …. <page/> … • but somewhat difficult to process and not very general Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Stand-off markup General solution to crossing hierarchies is to keep markup in separate documents that only point into the text (or other markup) Several specific recommendations and projects: • TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and Xpointer • Annotation Graphs with AGTK • TIGER annotation scheme Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Stand-off markup example: TIGER <s id="s5"> <graph root="s5_504"> <terminals> <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/> <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/> <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/> <t id="s5_4" word="mehr" pos="PIAT" morph="--"/> <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/> <t id="s5_6" word="als" pos="KOKOM" morph="--"/> <t id="s5_7" word="je" pos="ADV" morph="--"/> <t id="s5_8" word="zuvor" pos="ADV" morph="--"/> </terminals> <nonterminals> <nt id="s5_500" cat="NP"> <edge label="NK" idref="s5_1"/> <edge label="NK" idref="s5_2"/> </nt> <nt id="s5_501" cat="AVP"> <edge label="CM" idref="s5_6"/> <edge label="MO" idref="s5_7"/> <edge label="HD" idref="s5_8"/> </nt> …. Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Problems with stand-off markup • need tools to link the data: more difficult processing and editing • no automatic validity checking: consistency, cycles • difficult to change (correct) primarily data or downstream annotations Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Metadata • description of the corpus or corpus elements • traditional bibliographic standards (MARC) • but computer corpora need to be documented also along other dimensions: availability, size, markup used, relation of digital file to source text, etc. • EAD developed for archives, but many similarities to corpus description • a meta-data recommendation closely coupled with the data itself is the TEI header Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI header <teiHeader> is an obligatory part of every TEI document and consists of: • <fileDesc>, file descriptionfull bibliographical description of the computer file itself; includes information about the source or sources of the electronic text • <encodingDesc>, encoding descriptiondescribes relationship between electronic text and its source: normalization, ambiguity resolution, levels of encoding or analysis, etc. • <profileDesc>, text profileclassificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies) • <revisionDesc>, revision historyhistory of changes made during the development of the electronic text Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI header II. • an example of a TEI header • very detailed information is possible, but again, many ways to express the same information (e.g. free text or structured in elements) • stricter, but poorer alternatives exists: Dublin Core Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Dublin Core • Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. • Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights • can be extended • DC is used e.g. by the Open Language Archives Community (OLAC) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Corpus analytical tools Currently, many corpus exploration tools exists, and they typically offer: • search with regular expressions over strings • sometimes search over (lemma/PoS) annotations • concordance and word frequency list display of results • sometimes search and display of parallel corpora • sometimes basic statistic tests (keywordness, collocation strength) • examples: WordSmith, MonoConc, IMS CQP, Manatee/Bonito, SARA/Xaira, Tigersearch Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

What is missing • possibility to combine different types of annotation in queries and displays, esp. for multimodal corpora • integration of more powerful statistical methods, esp. for collocations and parallel corpora • tools targeted to different types of users (e.g. Sketch Engine) • merging of digital library viewers with corpus concordancing software Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Corpora v.s. digital libraries • classical reference corpora were composed of samples, and interesting only for their linguistic content • today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical texts) • conversely, digital libraries are growing in size and accessibility and becoming interesting also for linguistic research • what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading and allow for concordance views of selections • currently the only available (OS) system that attempts this is Philologic from University of Chicago Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

III. Future directions Two directions in standardisation of corpus and language resource annotation: • next version of TEI, version P5 • work by ISO TC 37 SC4 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI P5 • the next version of TEI, currently at beta stage: available, but not stable • significantly revised and brought in line with current practices • not backward compatible with P3/P4 (although scripts exists for conversion) • formal specification based on the ISO Relax NG schema language (although DTD and W3C schemas also available) • parameterisation also produces dedicated documentation Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

ISO TC 37 • ISO TC 37: ISO Technical Committee on Terminology, est. 1952 • maybe best known for ISO 639 and MARTIF • in 2002 changed name to Technical Committee on Terminology and Other Language Resources • also established ISO TC 37/SC 4Sub-Committee on Language Resource Management Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

ISO TC 37 SC4 WGs • WG 1 : Basic descriptors and mechanisms for language resources • terminology used in language resources, • basic mechanisms and data structures for linguistic representation • meta-data representation scheme to document linguistic information structures and processes • WG 2 : Representation schemes • definition of annotation/representation schemes for morpho-syntax and syntax • representation scheme for the semantic content of multimodal information, • metadata for discourse level representation scheme • WG 3 : Multilingual text representation • translation memory and alignment of parallel corpora, • segmentation and counting algorithms, • meta-markup for Globalization, Internationalization and Localization (GIL) • WG 4 : Lexical databases • standardization of lexical representation formats for the various types of NLP applications (Machine Readable Lexica) • WG 5 : Workflow of language resource management • Standardization of guidelines for language validation and net-based distributed cooperative work Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

WG4 standards • Language Resource Management — Feature Structures • Language resource management —Lexical markup framework (LMF) • Language Resource Management — Morpho-syntactic Annotation Framework (MAF) • all under development! Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Conclusions • I presented some history, current state and possible future directions in the field of encoding standardisation of, mainly, corpora • the main recommendation (for me!) still seems to be TEI: combines tradition with innovation Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Thank you!

Encoding language corpora: current trends and future directions