Welsh Natural Language Toolkit: Enhancing Computational Linguistic Applications

The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Group andreas.vlachidis@southwales.ac.uk http://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/

Overview • Background • Toolkit - WNLT • Named Entity Pipeline - CYMRIE • Known Limitations • Future Directions 2

Background WNLT: funded by the Welsh-language technology and digital media grant Aims: to develop a suite of open source software modules that enable Welsh Language computational linguistic applications 3

Background Builds: on the General Architecture for Text Engineering (GATE) by adapting and expanding existing modules. • Tokenizer • Sentence Splitter • Part of Speech Tagger • Lemmatizer • ANNIE (Gazetteer – NE Transducer) http://www.gate.ac.uk 4

The Toolkit Adapting to Welsh: a series of steps enabling processing of Welsh text that involved • Algorithmic Arrangements • Adding New Classes • Expanding and Overriding • Knowledge Based Input • Glossaries, Lexicons, Gazetteers • Rules and Configuration 5

Tokenizer Splits text into very simple tokens such as numbers, punctuation and words (upper case, lower case, orth) Tokenizer Classes. JAVA Rules. RegExp Post Process. Jape 6

Adapting Tokenizer Hyphenation • Place NamesLlanarmon-yn-Ial • Commonly Used Prefixcyd-ddefnyddir • Separate constituentscybydd-dod 7

Adapting Tokenizer Apostrophe • Initial Vowel Loss (concatenation)Dw i'n hoffy (yn) • Medially Vowel Lossi'engoed • Final Consonant Losscryf hapusa' 8

Adapting Tokenizer • Ordinals1af, 2il, 3ydd • Special Cases (Compound Prepositions) • Ar gyfer (for) • Er mwyn (for the sake of) • Yn erbyn (against) • Oddi am (off - from) • Oddi ar (off) 9

Sentence Splitter The sentence splitter segments the text into sentences. Sentence Splitter Classes. JAVA Lexicon of Abbreviations 10

Adapting Sentence Splitter • Uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds. • Nearly 400 Abbreviations 11

Adapting Sentence Splitter • Abbreviations List • Linguisticabs (absolute), cfst (synonym) • NarrativeBrth (British) , e.e (for example) • ScienceSeic (Psychology), Tiwt (Teutonic) • Spatial Morg (Glamorgan) • TemporalC.C (B.C), Mer (Wednesday) 12

Part of Speech Tagger • Produces a part-of-speech tag as an annotation on each word POS Classes. JAVA Lexicon. txt RuleSet 13

Part of Speech Tagger • EURFA Dictionary Eurfa is the largest Welsh dictionary under a free license • Contains verbal inflections • Does not contain mutated forms of nouns. • 210557 Records • Records contain • Lemma • Part of Speech • http://www.eurfa.org.uk/ 14

Part of Speech Tagger • EURFA Tags: adj, adv, comp, cond, conj, dem, dim, f, fut, imper, hyp, imperf, infin, int, m, mf, neg, n, num, ord, past, pluperf, poss, pl, prep, preq, pron, quan, rel, sg, sp, subj, v MAPPING • Hepple Tags: CC CD DT EX FW IN JJ JJR JJS JJSS LS MD NN NNP NNPS NNS NP NPS PDT POS PP PRPR$ PRP PRP$ RB RBR RBS RP STAART SYM TO UH VBD VBG VBN VBP VB VBZ WDT WP$ WP WRB 15

POS – Class and Lexicon 16

Lemmatizer • Considering one token one at a time, it identiﬁes its lemma. Uses a range of techniques in a cascading order to address major mutations Lemmatizer Rules. RegExp Postprocess Validation Lexicon. txt Classes. JAVA Mutation 17

Lemmatizer - Lexicon • The lexicon is invoked first 18

Lemmatizer - Rules • The Rules address some* plural forms *plural in Welsh is challenging 19

Lemmatizer - Postprocess • The postprocess use contextual rules for finding mutation (soft, aspirate, nasal) • Primarily addressing contact mutation • Reverting to 'original' from a mutated form • Reverted Form is Validated via Gazetteer • Some mutation cases cannot be resolved contextually ( eg soft mutation of G ) 20

Lemmatizer - Validation The reverted form is validated against Eurfa • If it exists in glossary then is valid • Else the reverted term is dropped *Eurfa contains many nouns but not their mutated forms 21

CYMRIE • Information Extraction system for Welsh • CYMRIE adapts ANNIE to Welsh • ANNIE (A Nearly-New Information Extraction System) : GATE’s IE system 22

CYMRIE • CYMRIE adapts ANNIE to Welsh input using a modified version of ANNIE targeted at the requirements of the Welsh language • A wide range of gazetteer lists • Named Entity Rules (NE Transducer) • CYMRIE does not currently include a co-reference resolution module 23

CYMRIE - Gazetteer • The Gazetteer contains: 70 Welsh lists • Newly introduced • Result of translation • 70674 unique entries (not including spelling variations) • 51 original ANNIE lists relating to person name, place name, company names etc 24

CYMRIE - Gazetteer • Major type groups • Date • Government • Facility • Job Title • Location • Organisation • Person • Stop-word • Time • Title Vocabulary Resources • TermCymry • The Welsh Assembly website • Wikipedia • National Gazetteer of Wales Minor type groups • Male • Female • Mountain • University • etc 25

CYMRIE – NE Transducer • Named Entity Recognition (Semantic Tagger) • Adapting ANNIE's NE Transducer to Welsh • Persongender: male, female • LocationlocType: region, airport, city, country, county, province • OrganizationorgType: company, department, government, newspaper • Money / Percent • Date kind: date, time, dateTime • Addresskind: email, url, phone, postcode, complete, ip, other 26

CYMRIE – NE Transducer • The adapted rules addressed • Syntactic behaviour (adjective after noun) • Post Brenhinol • Cymdeithas Hanes Ysbyty Gogledd Cymru • Use of the definite article • Heddlu 'r Abertawe • Controlled Vocabulary in rules • a, ac, (and) , San, Sant (saint) • Validation through noun phrase (proper nouns e.g. Mae) 27

CYMRIE – NE Transducer 28

Performance - Evaluation • Gold Standard • 2221 Tokens • 230 Entities (Date, Location, Organisation, Percent, Person) • Results • Tokenizer : Recall-99%, Precision-98%, F1-99% • POS: Recall-82%, Precision-81%, F1-81% • Lemma: Recall-80%, Precision-79%, F1-80% • NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as “half-matches” (average mode) 29

Known Limitations • The finite nature of knowledge resources (i.e. Dictionary) vs the non-finite nature of language • The role of contextual evidence in part of speech tagging • e.g. when “y” is a definite article and when a pronoun! • Mutations beyond “simple” contact. (“Transitive” relationship mutations e.g. sosban fach wen ) 30

Known Limitations • Evaluation • Currently the work is evaluated against a small Gold Standard • The Knowledge Base input is critical • Enhancing Eurfa with additional terms and lemmas • Contextual Analysis • Combination of training (Machine Learning) and Rules 31

Future Directions • Improve performance of WNLT via: • Enhancing Knowledge Resources • Improving Rules / RegExp Files • Adding POS context driven rules • Expand the scope of CYMRIE application • CorCenCC (Welsh Corpus) project? • Other entities of interest 32

Future Directions • Move into Social Media Analysis • Sentiment Analysis • Twitter feeds. • etc • Potential new modules • Co-reference • Noun Phraser • Verb Chunker 33

Acknowledgements • Special Thanks to • Welsh for Adults – Glamorgan Centre (USW) • Gareth Clee for helping with grammar • School of Welsh – Cardiff University • Jeremy Evas • Benjamin Screen for his help on evaluation and translation 34

The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Unit andreas.vlachidis@southwales.ac.ukhttp://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/

Welsh Natural Language Toolkit: Enhancing Computational Linguistic Applications

Welsh Natural Language Toolkit: Enhancing Computational Linguistic Applications

Presentation Transcript

Natural Language in AI

Natural Language Processing

Natural Language

Introduction to Natural Language Processing

Natural Language Processing

How to use the Toolkit

To ESB Toolkit or not to ESB Toolkit

Figurative language

WELSH FOOD

Language Planning: The Welsh Model

What Causes the Non-Use of a Minority Language? The Non Use of a Minority Language

Teenage Ankst : How a record label made the Welsh language cool

CS5545: Natural Language Generation

Natural Language Generation 74.793 Research Presentation

Towards Common Language Toolkit

CSA3050: Natural Language Generation

Natural Language Interfaces to Databases

UbiQit: A Toolkit for Physical Mash-ups of Ubiquitous Devices

Introduction to the SPICE Toolkit

Natural Language Processing

Natural Language driven Image Generation

VEX and Robot C