The FIDA & MULTEXT-East language resources

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006-05-09

Overview • Background • FIDA: a reference corpus of Slovene • MULTEXT-East: morphosyntactic resources for Central and East-European languages • Other language resources for Slovene Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Language Resources • LR comprise three layers of data: • corpora: mono- or multilingual, reference or specialised, … /variously annotated/ • lexica: vocabularies, morphosyntactic, syntactic, semantic, (ontologies) • standards: linguistic and technical encoding • LRs, esp. corpora are used for empirical language research: • linguistic studies:(annotated) corpus + (sophisticated) search engine • human language technology R&D:testing and training dataset Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part I.The FIDA corpus Slovene reference corpus for linguistic studies Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA http://www.fida.net/ Joint project (1997-2000) of • Filozofska fakultetaVojko Gorjanc, Marko Stabej, Špela Vintar • Institut Jožef StefanTomaž Erjavec • DZSSimon Krek • AmebisPeter Holozan, Miro Romih Financed by industry partnerns Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Characteristics of FIDA • monolingual • synchronous • written language • reference • representative • balanced • annotated Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Sizes Total 103,513,072 words29,177 textsAvg. text length 3,548 words Largest texts:Leksikon DZS: 508,370 words69 texts > 100.000 Smallest texts:2.648 < 100 words 2 x <w>rezgrtshdrghgth4</w> Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Time Composition • Oldest/most recent text: 1989/2000 • Average date 1997-02 • Texts/Words with unknown date: 3.94%/8.28% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy:publication types … Ft.P.P.O (published) 95.72% Ft.P.P.O.K (books) 22.71% Ft.P.P.O.P (periodicals) 70.50% Ft.P.P.O.P.C (newspaper) 46.59% Ft.P.P.O.P.C.D (daily) 32.67% Ft.P.P.O.P.C.T (weekly) 66.18% Ft.P.P.O.P.C.V (multi-weekly)17.74% … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy: text types Ft.Z (text type) 99.47% Ft.Z.N (non-ficiton) 93.57% Ft.Z.N.N (non-professional)75.14% Ft.Z.N.S (professional) 18.37% Ft.Z.N.S.H (hum. & soc. sci.) 10.57% Ft.Z.N.S.N (nat. & tech. sci.) 6.04% Ft.Z.U (fiction) 5.90% Ft.Z.U.D (drama) 0.10% Ft.Z.U.P (poetry) 0.17% Ft.Z.U.R (prose) 5.12% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Markup of FIDA • corpus elements annotated with meta-data (bibliographic, taxonomy) • text linguistically annotated • encoded according to international standards and recommendations • technical: SGML, TEI P3 • linguistic: MULTEXT-East(MULTEXT, EAGLES) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Linguistic annotation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Accesibility Exploitation by partners: • DZS: new dictionaries • Amebis: development of HLT • Arts faculty: teaching • IJS: research on HLT Availability to the public: • access via concordance engine by Amebis • free access, but displays only few hits • possibility of academic licences FIDA (web site) no longer maintained! Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA+ http://www.fidaplus.net/ • FIDA Plus project: • Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan • DZS, Amebis • Financed by the ministry+ ind. partners • Extend the corpus with • Web materials • spoken component • Better linguistic markup • Free concordances: up to 100 lines • Also possibility of licences Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Concordancer Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Output Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Extended searches Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Corpus “Nova Beseda”http://bos.zrc-sazu.si/ • being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) • Web concordancer with no hit limit • now larger than FIDA • but much less varied: fiction, Delo, DZ • not linguistically annotated Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part II.MULTEXT-East multilingual morphosyntactic resources for HLT development Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

MULTEXT-East resources • MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages • Based on the results of EU MULTEXT (~West) • To produce a harmonised BLARK for six languages: • corpus encoding standardisation (TEI / CES) • multilingual parallel, comparable, speech corpora • morphosyntactic specifications (EAGLES / MULTEXT) • (inflectional) lexicon • annotated corpus • language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

History of MULTEXT-East resources • First release 1998 on TELRI CD-ROM Vol II:already extended with new languages • Resources since 1998 available on the Web:http://nl.ijs.si/ME/ • Second release 2002 in scope of EU CONCEDE:re-encoding in XML/TEI, harmonisation • Third release 2004:merge of first two releases, further languages • Work (indirectly) supported by:TELRI, CONCEDE, NSF grant, bi-lateral projects Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The Languages of MULTEXT-East • Germanic: English • Romance: Romanian • Baltic: • Latvian • Lithuanian • Finno-Ugric: • Estonian • Hungarian Slavic: • Russian (East Slavic) • Czech (West Slavic) • Slovene(South West Slavic) • Resian (Slovene dialect) • Croatian (South West Slavic) • Serbian (South West Slavic) • Bulgarian (South East Slavic) In progress: • Macedonian • Persian Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Version 3 • Available on http://nl.ijs.si/ME/V3/ • Some parts completely free, others free for research  Web licence • Web pages gives: • extensive documentation • bibliography list • web licence form • resource download Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The MULTEXT morphosyntactic trinity • MULTEXT-East morphosyntactic specifications • MULTEXT-East morphosyntactic lexica • MULTEXT-East morphosyntactically annotated "1984" corpus Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

1. Morphosyntactic specifications • Based on EAGLES / MULTEXT • Define PoS, their attributes and values • The specs are a document containing: • introduction • common tables • language particular sections • Written in LaTeX  PDF & HTML • Derived XML/TEI encoding as feature structures Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example common table Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example language specific section table(shows only categories actually used) notes combinations lexicon for Slovene (FIDA):localisation of category names Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Morphosyntactic Complexity Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

2. The lexica • Medium size morphosyntactic lexica • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. • ~ all word-forms of cca 15.000 lemmas • Lexical entry is composed of three fields: • the word-form: the inflected form of the word • the lemma: the base-form of the word • the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Lexicon sizes Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

3. The “1984” corpus • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) • Structuraly annotated • Sentence aligned with English • Words annotated with lemma and MSD • Encoded in TEI P4 (XML) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example linguistic encoding Sentence alignment & Context disambiguated lemmas and MSDs <text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Quantifying the corpus Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Utility of MULTEXT-East LRs • Specifications became, for some, the “national” standard • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP • A base dataset for further annotation and experiments: • Word-sense disambiguation • WordNet development and evaluation • Syntactic parser induction • Teaching aid in HLT courses • ~ 100 registered users • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

LRs @ JSIhttp://nl.ijs.si/nl.html#Resource Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Overview of Slovene LRs and services @ Slovenian Language Technologies Societyhttp://nl.ijs.si/sdjt/ Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Thank you! Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The FIDA & MULTEXT-East language resources