Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusaApplikation des MULTEXT-East und der TEI-Normen bei der Erstellung vonParallelkorporaApplication of MULTEXT-East and TEI in the compilation of parallel corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/

Overview • The need for standardisation • Corpus encoding in TEI • MULTEXT-East morphosyntactic descriptions Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Why standards (for digital language resources)? • public documentation (+ software) • (semi)automated validation • application independent • platform independent • do not become obsolescent (as fast) • However: • demand time to understand and use them • there are (too) many and not all are accepted • they are not perfectly tuned to application (overhead) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

TEI: the Text Encoding Initiative • TEI Guidelines are a vocabulary to describe text for scholarly purposes • They consist of: • XML schemas • documentation • P3 (1994), P4 (2002), P5 (0.9, 2007) • being developed by the TEI Consortium • large user base, web site, mailing list, tutorials, yearly meetings • increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Jp-Sl dictionary <entry id="jaslo.113"> <form type="hw"> <orth type="roma">akeru</orth> <orth type="kana">あける</orth> <orth type="kanji">開ける</orth> </form> <gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp> <form type="infl"> <orth type="v-masu">あけます</orth> <orth type="v-te">あけて</orth> <orth type="v-nai">あけない</orth> </form> <trans><tr>odpreti</tr></trans> <eg><q>穴（あな）をあける</q> <tr>narediti luknjo</tr></eg> <eg><q>窓（まど）を開ける</q> <tr>odpreti okno</tr></eg> <xr type="related"> <lbl>prim.</lbl> <ref>開く（あく）</ref> <lbl>intr.</lbl> </xr> <usg type="level">4</usg> </entry> Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example: MULTEXT-East “1984”, Serbian <text id="mteo-sr." lang="sr"> <body id="Osr" lang="sr"> <div id="Osr.1" n="1" type="part"> <head>Prvi deo</head> <div id="Osr.1.2" n="1" type="chapter"> <head>1.</head> <p id="Osr.1.2.2"> <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na časovnicima je izbijalo trinaest.</s> <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no nedovoljno hitro da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim.</s> </p> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

MULTEXT-East • MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages • Based on the results of EU MULTEXT (~West) • To produce a harmonised BLARK for six languages: • morphosyntactic specifications (EAGLES / MULTEXT) • morphosyntacticaly annotated parallel corpus • inflectional lexica • multilingual comparable, speech corpora • language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

History of MULTEXT-East resources • First release 1998 on CD-ROM:already extended with new languages • Resources since 1998 available on the Web:http://nl.ijs.si/ME/ • Second release 2002 (EU CONCEDE):re-encoding in XML/TEI, harmonisation • Third release 2004:merge of first two releases, further languages • Fourth release 2007 (?) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The Languages of MULTEXT-East • Slavic: • Russian (East Slavic) • Czech (West Slavic) • Slovene (South West Slavic) • Resian (Slovene dialect) • Croatian (South West Slavic)-- Marko Tadič • Serbian(South West Slavic)-- C. Krstev, D. Vitas • Bulgarian (South East Slavic) • In progress: • Macedonian • Persian • Germanic: English • Romance: Romanian • Baltic: • Latvian • Lithuanian • Finno-Ugric: • Estonian • Hungarian • (BalkaNet): • Greek • Tukrish) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The MULTEXT morphosyntactic trinity • MULTEXT-East morphosyntactic specifications (Croatian, Serbian) • MULTEXT-East morphosyntactic lexica (Serbian) • MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

1. Morphosyntactic specifications • Based on EAGLES / MULTEXT • Define PoS, their attributes and values • The specs are a document containing: • introduction • common tables • language particular sections • Written in LaTeX  PDF & HTML • Derived XML/TEI encoding as feature structures • In Version 4 specifications to be fully in TEI/XML Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example common table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example language specific table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

2. The lexica • Medium size morphosyntactic lexica • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. • ~ all word-forms of cca 15.000 lemmas • Lexical entry is composed of three fields: • the word-form: the inflected form of the word • the lemma: the base-form of the word • the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

3. The “1984” corpus • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) • Structurally annotated • Sentence aligned with English • Words annotated with lemma and MSD • Encoded in TEI P4 (XML) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example linguistic encoding Context disambiguated lemmas and MSDs <text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Utility of MULTEXT-East LRs • Specifications became, for some, the “national” standard • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP • A base dataset for further annotation and experiments: • Word-sense disambiguation • WordNet development and evaluation • Syntactic parser induction • Teaching aid in HLT courses • ~ 100 registered users • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian,Bosnian? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Corpora using TEI+MULTEXT-East • Reference corpus of Slovene:FIDA (100Mw), FIDA+ (600Mw)(+ other Sl. corpora) • Croatian National Corpus:HNK (100Mw) • Various Romanian corpora, … • En-Sl parallel annotated corpus:SVEZ-IJS (10Mw) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Conclusions • TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis • MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions • Both have already been used for a number of corpora • Maybe also for BKS? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Thank you!

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Presentation Transcript

New Assessments and Environments for Knowledge Building: Design Challenges

LESSONS FROM THE USE OF PRECISION DAIRY TECHNOLOGIES

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

Michael Aquilino Microelectronic Engineering Department Rochester Institute of Technology

Language Technologies (1)

Electrical Impedance Tomography

Knowledge Integration by Genetic Algorithms

Miran Gajsek, M estna občina ljubljana , Slovenija

Re-imagine500 vREI.500.120304

Discourse theories and technologies

SPECIFIC TREATMENT TECHNOLOGIES FOR HEALTHCARE WASTE

NATIONAL KIDNEY AND TRANSPLANT INSTITUTE

Emerging Technologies

Teaching Statistical Concepts with Simulated Data

Department of Education School of Education and Behavioral Sciences