210 likes | 324 Views
Prim(j)ena M ULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des M ULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of M ULTEXT-East and TEI in the compilation of parallel corpora. Tomaž Erjavec Department of Knowledge Technologies
E N D
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusaApplikation des MULTEXT-East und der TEI-Normen bei der Erstellung vonParallelkorporaApplication of MULTEXT-East and TEI in the compilation of parallel corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/
Overview • The need for standardisation • Corpus encoding in TEI • MULTEXT-East morphosyntactic descriptions Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Why standards (for digital language resources)? • public documentation (+ software) • (semi)automated validation • application independent • platform independent • do not become obsolescent (as fast) • However: • demand time to understand and use them • there are (too) many and not all are accepted • they are not perfectly tuned to application (overhead) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
TEI: the Text Encoding Initiative • TEI Guidelines are a vocabulary to describe text for scholarly purposes • They consist of: • XML schemas • documentation • P3 (1994), P4 (2002), P5 (0.9, 2007) • being developed by the TEI Consortium • large user base, web site, mailing list, tutorials, yearly meetings • increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Jp-Sl dictionary <entry id="jaslo.113"> <form type="hw"> <orth type="roma">akeru</orth> <orth type="kana">あける</orth> <orth type="kanji">開ける</orth> </form> <gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp> <form type="infl"> <orth type="v-masu">あけます</orth> <orth type="v-te">あけて</orth> <orth type="v-nai">あけない</orth> </form> <trans><tr>odpreti</tr></trans> <eg><q>穴(あな)をあける</q> <tr>narediti luknjo</tr></eg> <eg><q>窓(まど)を開ける</q> <tr>odpreti okno</tr></eg> <xr type="related"> <lbl>prim.</lbl> <ref>開く(あく)</ref> <lbl>intr.</lbl> </xr> <usg type="level">4</usg> </entry> Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example: MULTEXT-East “1984”, Serbian <text id="mteo-sr." lang="sr"> <body id="Osr" lang="sr"> <div id="Osr.1" n="1" type="part"> <head>Prvi deo</head> <div id="Osr.1.2" n="1" type="chapter"> <head>1.</head> <p id="Osr.1.2.2"> <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na časovnicima je izbijalo trinaest.</s> <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no nedovoljno hitro da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim.</s> </p> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
MULTEXT-East • MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages • Based on the results of EU MULTEXT (~West) • To produce a harmonised BLARK for six languages: • morphosyntactic specifications (EAGLES / MULTEXT) • morphosyntacticaly annotated parallel corpus • inflectional lexica • multilingual comparable, speech corpora • language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
History of MULTEXT-East resources • First release 1998 on CD-ROM:already extended with new languages • Resources since 1998 available on the Web:http://nl.ijs.si/ME/ • Second release 2002 (EU CONCEDE):re-encoding in XML/TEI, harmonisation • Third release 2004:merge of first two releases, further languages • Fourth release 2007 (?) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
The Languages of MULTEXT-East • Slavic: • Russian (East Slavic) • Czech (West Slavic) • Slovene (South West Slavic) • Resian (Slovene dialect) • Croatian (South West Slavic)-- Marko Tadič • Serbian(South West Slavic)-- C. Krstev, D. Vitas • Bulgarian (South East Slavic) • In progress: • Macedonian • Persian • Germanic: English • Romance: Romanian • Baltic: • Latvian • Lithuanian • Finno-Ugric: • Estonian • Hungarian • (BalkaNet): • Greek • Tukrish) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
The MULTEXT morphosyntactic trinity • MULTEXT-East morphosyntactic specifications (Croatian, Serbian) • MULTEXT-East morphosyntactic lexica (Serbian) • MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
1. Morphosyntactic specifications • Based on EAGLES / MULTEXT • Define PoS, their attributes and values • The specs are a document containing: • introduction • common tables • language particular sections • Written in LaTeX PDF & HTML • Derived XML/TEI encoding as feature structures • In Version 4 specifications to be fully in TEI/XML Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example common table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example language specific table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
2. The lexica • Medium size morphosyntactic lexica • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. • ~ all word-forms of cca 15.000 lemmas • Lexical entry is composed of three fields: • the word-form: the inflected form of the word • the lemma: the base-form of the word • the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
3. The “1984” corpus • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) • Structurally annotated • Sentence aligned with English • Words annotated with lemma and MSD • Encoded in TEI P4 (XML) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example linguistic encoding Context disambiguated lemmas and MSDs <text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Utility of MULTEXT-East LRs • Specifications became, for some, the “national” standard • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP • A base dataset for further annotation and experiments: • Word-sense disambiguation • WordNet development and evaluation • Syntactic parser induction • Teaching aid in HLT courses • ~ 100 registered users • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian,Bosnian? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Corpora using TEI+MULTEXT-East • Reference corpus of Slovene:FIDA (100Mw), FIDA+ (600Mw)(+ other Sl. corpora) • Croatian National Corpus:HNK (100Mw) • Various Romanian corpora, … • En-Sl parallel annotated corpus:SVEZ-IJS (10Mw) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Conclusions • TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis • MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions • Both have already been used for a number of corpora • Maybe also for BKS? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute