330 likes | 487 Views
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia. Dublin April 3 rd , 2009. Overview of the talk.
E N D
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languagesTomaž Erjavechttp://nl.ijs.si/et/Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia Dublin April 3rd, 2009
Overview of the talk • Part-of-speech tagging, tagsets and interoperability • MULTEXT(-East) morphosyntactic specifications • Languages, formats, transformations • An application: JOS resources for Slovene • Conclusions Dublin, 4.4.2009
Part-of-speech tagging • The task of assigning the correct PoS tag to each word in a running text, e.g.Under/INthe/DTproposal/NN,/,Delmed/NNPwould/MDissue/VBabout/IN123.5/CDmillion/CDadditional/JJDelmed/NNPcommon/JJshares/NNSto/TOFresenius/NNP… • Important HLT infrastructure • Very useful annotations for linguists • Some applications: • pre-processing step for further analyses: lemmas, syntactic structure, etc. • text indexing, e.g. nouns are more useful than verbs Dublin, 4.4.2009
Methods of PoS tagging • PoS tagging: • determine ambiguity class or word (saw → NN | VBD) • disambiguate to correct tag in (local) context(“I saw/VBDa saw/NN “) • Tagger training: • manually annotated corpus: source of probabilities for tags given a (local) context + • (lexicon: gives possible tags for each word-form) • Popular taggers: • TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) • Tagging usefulness as well as accuracy crucially depends on the tagset Dublin, 4.4.2009
English tagsets • Tagging first developed for English (Brown, CLAWS, PTB tagsets) • English inflectionally very poor language → small tagsets ~ 50 different tags • Tags are typically “synthetic”, i.e. the tag does not transparently map to features e.g. : • to/TO (PoS?) • Delmed/NNP (number?) • shares/NNS (number?) Dublin, 4.4.2009
Tagsets for other languages • will often have many more morphosyntactic features associated with a word, so tagsets will be larger • e.g. Slovene nouns: • type: common, proper • gender: masculine, feminine, neuter • number: singular, dual, plural • case: nom., gen., dat., acc., loc., ins. • (animacy: yes, no) • = 104 “PoS” tags just for Nouns • Russian, Czech, Slovene ~ 1000-2000 word level syntactict tags Dublin, 4.4.2009
PoS tags vs. MSDs • PoS tags: • used in corpora for corpus annotations / tagging • typically synthetic • Morphosyntactic Descriptions (MSDs): • used in inflectional lexica for lexical annotations / morphological analysis • typically analytic • Relation of PoS tagsets to MSD tagsets/features • in general: |PoS| < |MSD| • but in most MULTEXT-East languages: [PoS] ≡ [MSD] Dublin, 4.4.2009
Developing a multilingual morphosyntactic framework • Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented • Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it Dublin, 4.4.2009
so, wouldn’t it be nice to have: • an open, standardised, documented, flexible model for MSD/PoS tagset design, • that would be instantiated for lots of languages, • and could be simply applied to any language? Dublin, 4.4.2009
EU standardisation efforts • EAGLES: Expert Advisory Group for Language Engineering Standards (1993-1996) • MULTEXT: Multilingual Text Tools and Corpora (1995) • MULTEXT-East: MULTEXT for Central and Eastern European Languages: • Version 1: TELRI edition (1998) • Version 2: Concede edition (2002) • Version 3: TEI edition (2004) • Version 4: MondiLex edition (2009?) • ... • ISO / TC 37 / LMF / isoCat (2008) Dublin, 4.4.2009
MULTEXT-East morphosyntactic resources • Basic Language Resource Kit: • specifications:define features and MSDs • lexica (~15,000 lemmas):triplets: word-form / lemma / MSD • parallel corpus: MSD and lemma annotated • Freely available for research http://nl.ijs.si/ME/ Dublin, 4.4.2009
1984: aligned and annotated Dublin, 4.4.2009
MULTEXT-East languages Dublin, 4.4.2009
The MULTEX(-East) morphosyntactic specifications • They specify that e.g.”Ncmsn” • corresponds to the feature-structure[Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] • is a valid MSD for Slovene • Specifications consist of • Front matter • Common part - common definitions for all languages (features) • Language particular parts - particulars for each language (MSD set) Dublin, 4.4.2009
V4 specs draft in HTML Dublin, 4.4.2009
Specifications in Version 4 • Encoded in XML / teiLite(in Version 3: LaTeX) • TEI = Text Encoding Initiative Guidelines P4 • Still in “book-like” in form, to make authoring easier • XSLT into other formats: • HTML • tabular mapping formats(e.g. MSD to features) • XML/TEI feature library • (OWL) Dublin, 4.4.2009
The common specifications • Define categories (“parts-of-speech”) • For each category define features, i.e. attributes and their values • For each attribute-value specify for which languages it is appropriate • Give positional mapping to MSDs: • each attribute assigned a position • each attribute-value assigned a one-character code Dublin, 4.4.2009
Common table (HTML) Dublin, 4.4.2009
Common table (source XML/teiLite) Dublin, 4.4.2009
Language particular sections • Recap the feature definitions for the language • Add “combinations”, i.e. feature-coocurrence restrictions • Add “lexicon”, i.e. list of all valid MSDs for language • Possibly localise the features and codes • Possibly give notes and examples Dublin, 4.4.2009
Combinations Dublin, 4.4.2009
Lexicon Dublin, 4.4.2009
Jezikoslovno označevanje slovenščinehttp://nl.ijs.si/jos Dublin, 4.4.2009
JOS as a bridge to MULTEXT-East Version 4 FidaPLUScorpus MTE V3 slvspecifications JOScorpora JOS (slv)specifications MTE V4 specifications MTE V4 (slv)specifications Dublin, 4.4.2009
JOS specifications • XML/teiLite + XSLT transforms • Allow reordering of attribute positions(Vm-----d → Vmd) • i18n / slv+eng: • translation: specifications • localisation: attributes, values, codes • localisation: TEI element names Dublin, 4.4.2009
MSD conversion tables • Tabular UTF-8 files • MSD-slv to -eng • MSD to features • Collating sequence e.g. 01N0101010100 Somei Ncmsn 01N0101010200 Somer Ncmsg 01N0101010300 Somed Ncmsd Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0 Dublin, 4.4.2009
Adding a new language • XSLT scripts: • mtems-split.xsl: make a template for the language particular section of a new language • mtems-merge: merge a new language particular section to the common tables • Maybe shortly to be tested on new Slavic languages in the scope of MondiLex Dublin, 4.4.2009
Critiques • It’s just an exercise in encoding anyway • Same is different, different is same • The Procrustean bed of standards • Policy change: from unification to harmonisation (hippy school) Dublin, 4.4.2009
Conclusions • Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications • Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) • Could serve as “hub” encoding for multilingual applications, e.g. MT • and as an framework for new languages Dublin, 4.4.2009
Further work • Finishing MTE V4! • Distribution: LDC, ELDA • Relation to ISO-TC37 standards: • LMF, isoCAT • Connecting to GOLD ontology • Adding new languages: • Slavic completion • Western European: MULTEXT • Japanese: chasen tagset, jpWaC(-L2) • Irish?☺ Dublin, 4.4.2009