350 likes | 356 Views
The Welsh Natural Language Toolkit WNLT & CYMRIE. Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Group andreas.vlachidis@southwales.ac.uk http://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/. Overview.
E N D
The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Group andreas.vlachidis@southwales.ac.uk http://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/
Overview • Background • Toolkit - WNLT • Named Entity Pipeline - CYMRIE • Known Limitations • Future Directions 2
Background WNLT: funded by the Welsh-language technology and digital media grant Aims: to develop a suite of open source software modules that enable Welsh Language computational linguistic applications 3
Background Builds: on the General Architecture for Text Engineering (GATE) by adapting and expanding existing modules. • Tokenizer • Sentence Splitter • Part of Speech Tagger • Lemmatizer • ANNIE (Gazetteer – NE Transducer) http://www.gate.ac.uk 4
The Toolkit Adapting to Welsh: a series of steps enabling processing of Welsh text that involved • Algorithmic Arrangements • Adding New Classes • Expanding and Overriding • Knowledge Based Input • Glossaries, Lexicons, Gazetteers • Rules and Configuration 5
Tokenizer Splits text into very simple tokens such as numbers, punctuation and words (upper case, lower case, orth) Tokenizer Classes. JAVA Rules. RegExp Post Process. Jape 6
Adapting Tokenizer Hyphenation • Place NamesLlanarmon-yn-Ial • Commonly Used Prefixcyd-ddefnyddir • Separate constituentscybydd-dod 7
Adapting Tokenizer Apostrophe • Initial Vowel Loss (concatenation)Dw i'n hoffy (yn) • Medially Vowel Lossi'engoed • Final Consonant Losscryf hapusa' 8
Adapting Tokenizer • Ordinals1af, 2il, 3ydd • Special Cases (Compound Prepositions) • Ar gyfer (for) • Er mwyn (for the sake of) • Yn erbyn (against) • Oddi am (off - from) • Oddi ar (off) 9
Sentence Splitter The sentence splitter segments the text into sentences. Sentence Splitter Classes. JAVA Lexicon of Abbreviations 10
Adapting Sentence Splitter • Uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds. • Nearly 400 Abbreviations 11
Adapting Sentence Splitter • Abbreviations List • Linguisticabs (absolute), cfst (synonym) • NarrativeBrth (British) , e.e (for example) • ScienceSeic (Psychology), Tiwt (Teutonic) • Spatial Morg (Glamorgan) • TemporalC.C (B.C), Mer (Wednesday) 12
Part of Speech Tagger • Produces a part-of-speech tag as an annotation on each word POS Classes. JAVA Lexicon. txt RuleSet 13
Part of Speech Tagger • EURFA Dictionary Eurfa is the largest Welsh dictionary under a free license • Contains verbal inflections • Does not contain mutated forms of nouns. • 210557 Records • Records contain • Lemma • Part of Speech • http://www.eurfa.org.uk/ 14
Part of Speech Tagger • EURFA Tags: adj, adv, comp, cond, conj, dem, dim, f, fut, imper, hyp, imperf, infin, int, m, mf, neg, n, num, ord, past, pluperf, poss, pl, prep, preq, pron, quan, rel, sg, sp, subj, v MAPPING • Hepple Tags: CC CD DT EX FW IN JJ JJR JJS JJSS LS MD NN NNP NNPS NNS NP NPS PDT POS PP PRPR$ PRP PRP$ RB RBR RBS RP STAART SYM TO UH VBD VBG VBN VBP VB VBZ WDT WP$ WP WRB 15
Lemmatizer • Considering one token one at a time, it identifies its lemma. Uses a range of techniques in a cascading order to address major mutations Lemmatizer Rules. RegExp Postprocess Validation Lexicon. txt Classes. JAVA Mutation 17
Lemmatizer - Lexicon • The lexicon is invoked first 18
Lemmatizer - Rules • The Rules address some* plural forms *plural in Welsh is challenging 19
Lemmatizer - Postprocess • The postprocess use contextual rules for finding mutation (soft, aspirate, nasal) • Primarily addressing contact mutation • Reverting to 'original' from a mutated form • Reverted Form is Validated via Gazetteer • Some mutation cases cannot be resolved contextually ( eg soft mutation of G ) 20
Lemmatizer - Validation The reverted form is validated against Eurfa • If it exists in glossary then is valid • Else the reverted term is dropped *Eurfa contains many nouns but not their mutated forms 21
CYMRIE • Information Extraction system for Welsh • CYMRIE adapts ANNIE to Welsh • ANNIE (A Nearly-New Information Extraction System) : GATE’s IE system 22
CYMRIE • CYMRIE adapts ANNIE to Welsh input using a modified version of ANNIE targeted at the requirements of the Welsh language • A wide range of gazetteer lists • Named Entity Rules (NE Transducer) • CYMRIE does not currently include a co-reference resolution module 23
CYMRIE - Gazetteer • The Gazetteer contains: 70 Welsh lists • Newly introduced • Result of translation • 70674 unique entries (not including spelling variations) • 51 original ANNIE lists relating to person name, place name, company names etc 24
CYMRIE - Gazetteer • Major type groups • Date • Government • Facility • Job Title • Location • Organisation • Person • Stop-word • Time • Title Vocabulary Resources • TermCymry • The Welsh Assembly website • Wikipedia • National Gazetteer of Wales Minor type groups • Male • Female • Mountain • University • etc 25
CYMRIE – NE Transducer • Named Entity Recognition (Semantic Tagger) • Adapting ANNIE's NE Transducer to Welsh • Persongender: male, female • LocationlocType: region, airport, city, country, county, province • OrganizationorgType: company, department, government, newspaper • Money / Percent • Date kind: date, time, dateTime • Addresskind: email, url, phone, postcode, complete, ip, other 26
CYMRIE – NE Transducer • The adapted rules addressed • Syntactic behaviour (adjective after noun) • Post Brenhinol • Cymdeithas Hanes Ysbyty Gogledd Cymru • Use of the definite article • Heddlu 'r Abertawe • Controlled Vocabulary in rules • a, ac, (and) , San, Sant (saint) • Validation through noun phrase (proper nouns e.g. Mae) 27
Performance - Evaluation • Gold Standard • 2221 Tokens • 230 Entities (Date, Location, Organisation, Percent, Person) • Results • Tokenizer : Recall-99%, Precision-98%, F1-99% • POS: Recall-82%, Precision-81%, F1-81% • Lemma: Recall-80%, Precision-79%, F1-80% • NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as “half-matches” (average mode) 29
Known Limitations • The finite nature of knowledge resources (i.e. Dictionary) vs the non-finite nature of language • The role of contextual evidence in part of speech tagging • e.g. when “y” is a definite article and when a pronoun! • Mutations beyond “simple” contact. (“Transitive” relationship mutations e.g. sosban fach wen ) 30
Known Limitations • Evaluation • Currently the work is evaluated against a small Gold Standard • The Knowledge Base input is critical • Enhancing Eurfa with additional terms and lemmas • Contextual Analysis • Combination of training (Machine Learning) and Rules 31
Future Directions • Improve performance of WNLT via: • Enhancing Knowledge Resources • Improving Rules / RegExp Files • Adding POS context driven rules • Expand the scope of CYMRIE application • CorCenCC (Welsh Corpus) project? • Other entities of interest 32
Future Directions • Move into Social Media Analysis • Sentiment Analysis • Twitter feeds. • etc • Potential new modules • Co-reference • Noun Phraser • Verb Chunker 33
Acknowledgements • Special Thanks to • Welsh for Adults – Glamorgan Centre (USW) • Gareth Clee for helping with grammar • School of Welsh – Cardiff University • Jeremy Evas • Benjamin Screen for his help on evaluation and translation 34
The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Unit andreas.vlachidis@southwales.ac.ukhttp://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/