350 likes | 904 Views
Language Resources for Maltese. Mike Rosner Dept. Artificial Intelligence University of Malta mike.rosner@um.edu.mt. Malta. Team. Mike Rosner, Dept AI, UoM Ray Fabri, Inst. Linguistics, UoM Duncan Attard, RA, Dept AI, UoM Albert Gatt, Aberdeen and UoM …. and others . Outline.
E N D
Language Resources for Maltese Mike RosnerDept. Artificial Intelligence University of Malta mike.rosner@um.edu.mt Language Resources for Maltese
Malta Language Resources for Maltese
Team • Mike Rosner, Dept AI, UoM • Ray Fabri, Inst. Linguistics, UoM • Duncan Attard, RA, Dept AI, UoM • Albert Gatt, Aberdeen and UoM …. and others Language Resources for Maltese
Outline • Maltese Language • MLRS • Corpus • Lexicon • Conclusion • Demo Language Resources for Maltese
Maltese Language • National language of the Maltese Islands (along with English). • c.1M native speakers (Malta, Australia, Canada, UK) • Real language • Mixed Language • Arabic: kelb (dog) • Romance: karozza (car) • English: swiċċ; ners; owkej • Latin script + some special characters • ċ, ġ, ħ, ż, għ, ie • Vowels are written (unlike Arabic) • kiteb Language Resources for Maltese
Semitic Morphology • Root-and-template based • Root has 3 consonantse.g. "k t b" • Template is a pattern of consonants and vowels e.g. CVCVC • Vocalism = 2 vowels e.g. "i e" • Word formed by interdigitation • interdigitate(ktb, ie, CVCVC) → kiteb Language Resources for Maltese
Semitic Morphology • ħadem work (verb); • ħaddiem worker; • ħidma work (noun); • ħadem be worked (verb passive); • ħaddem caused to work. Language Resources for Maltese
Sound Plural formed by suffixes: (a) Romance karozza/karozzi (car) tappit/tappiti (carpet) (b) Semitic ikla/ikliet (food) Broken Plural change of stemdrop of vowel qamar/qmura tifel/tfal ġdid/ġodda (new) tappit/twapet (carpet) Plural Formation Language Resources for Maltese
Morpho-Syntactic Features • Verb-less sentencesIl-karozza ġdid/the car is new • Construct state (inalienable possession)Id it-tifel/the boy's hand • Sun-lettersix-xemx/the sunit-tifel/the boy Language Resources for Maltese
Construct State • Id it-tifel fil-but • Id it-tifel fil-but • hand (def) the boy in the pocket • The boy's hand (is) in the pocket Language Resources for Maltese
Italian Borrowing spjega explain (It. spiegare) jispjega he explains nispjegaw we explain spjegat she explained spjegajt I explained, etc. English Borrowing ixxuttja kick a football (Eng. shoot) jixxuttja he kicks nixxuttjaw we kick ixxuttjat she kicked ixxuttjajt I kicked, etc. Verbs with Semitic Inflections Language Resources for Maltese
Clitic Pronouns • bgħatthielux • bgħat − t − hie − lu − x • send past to her it not 1SM • I didn't send it to her Language Resources for Maltese
Summary • Mixed language • Morphology and syntax more mixed together than in other European languages (typical of Semitic langs) • Empirical work needs to be carried out to establish correct morphosyntactic description. • Lack of systematic language resources Language Resources for Maltese
Language Resources • Natural language processing systems and tools, • Linguistic research that yields new knowledge about the language itself, and • Language-related industries such as software localization, translation, publishing etc. Language Resources for Maltese
Maltese Language Resource Server (MLRS) • RTDI National Project • Main Deliverables: • Maltese National Corpus (Server) • Computational Lexicon (Server) • Subsidiary Deliverables - tools for access, creation and maintenance of resources • Tokeniser • Part of Speech Tagger • NP Chunker Language Resources for Maltese
Same Data, Different Services Language Resources for Maltese
Corpus • Representative • Accessible to • contributors • editors • other users • Multiple levels of annotation • Word extraction Language Resources for Maltese
2 Dimensional Corpus Text Category Language Resources for Maltese
Levels of Annotation Language Resources for Maltese
c. 20 Text Categories Language Resources for Maltese
Corpus Website Language Resources for Maltese
Wordlist Management • User submits text, files or page URLs. • These resources are scanned and the words extracted from them and displayed. • User edits the resulting lists of extracted words manually. • User submits final version for incorporation into the wordlist database. Language Resources for Maltese
Current Corpus • 50M words at level 0, predominantly news, legal, government. Some fiction. • Submission requires a signed agreement from contributors. • Level 0 • catalogue: visible to all • contents: only visible to submitter. • Level 1 and higher • catalogue and contents: visible to all Language Resources for Maltese
Morphosyntactic AnnotationLevel III • Tagset: a predetermined collection of tags for Maltese (Albert Gatt/Ray Fabri) • Brill Tagger (Brill 1996) • Training phase – hand tagging. • Each tag can be regarded as a set of attribute/value pairs • For example, the tag NCS stands for{Cat=noun, Type=common, Num=sing} Language Resources for Maltese
Lexicon - Aims • Broad coverage • Support for different kinds of lexical information • Syntactic (Part of Speech + other) • Phonetic Spelling • Translation (En) • Interaction with linguist over Internet Language Resources for Maltese
Lexicon Construction: Workflow • Extract wordlists from text (automatic) • Identify/correct headwords (semi-automatic) • Alignment techniques (Dalli 2001) • Automatic prefix/suffix recognition (Attard 2004) • For each headword, construct lexical entry (manual) • Led (Lexicon Editor) Language Resources for Maltese
Lexicon Editor Language Resources for Maltese
Object Description Language • OO language for handling dependencies between lexical fields. • Primarily affects linguist interface. • An ODL description contains the following parts in order: • Enumeration Declarations • Class Declarations • Rules (Optional) • Macro Definitions (Optional) Language Resources for Maltese
ODL Example enum Number { Singular, Plural, Dual } class NOUN { Cat = noun; Type = common | proper; Number = *; } class PRONOUN: NOUN { Case = *; } if (Number == Plural){ !Gender } Language Resources for Maltese
Current Status • Website (http://mlrs.cs.um.edu.mt) • User Classes (public; linguist; administrator) • Corpus • Web interface • Tools level0; level1; level 2 • Collection approx 50MB @ level 0 • Lexicon • Editor/Browser • ODL version 0 Language Resources for Maltese
Future Work • Manual annotation: • POS annotation to train tagger • Migration of level 0 to level 1 • Morphological component • Morphological analyser/synthesiser • Relationships between lexical entries • HPSG integration. Stefan Muller, Saarbruecken. • Compatibility/Integration with existing lexical resources (cf WordNet) • Language-enabled tools. • Spellchecker • IE • Translation Language Resources for Maltese
Inheritance and Morphology j a s l u { pers=1, mamma=wasal, num=plur } Language Resources for Maltese
Conclusion • Cross-disciplinary (Ling/CLing/CS) project presents challenges. • Training of automatic tagger has been a bottleneck. • Stable funding/support required beyond life of project Language Resources for Maltese
Valletta in Winter Language Resources for Maltese