190 likes | 319 Views
Using Technology Transfer to Advance Automatic Lemmatisation for Setswana. Introduction Lemmatisation Methodology Conclusion. Overview. Introduction Lemmatisation Lemmatisation in Setswana Lemmatisation in Afrikaans Methodology Memory-based Learning Architecture Data Implementation
E N D
Using Technology Transfer to Advance Automatic Lemmatisation for Setswana
Introduction Lemmatisation Methodology Conclusion Overview • Introduction • Lemmatisation • Lemmatisation in Setswana • Lemmatisation in Afrikaans • Methodology • Memory-based Learning • Architecture • Data • Implementation • Conclusion
Introduction Lemmatisation Methodology Conclusion Introduction I 31 March 2009; Athens • South Africa has 11 official languages • English has the most HLT resources • Situation is changing • SA Government is supporting initiatives to develop core linguistic resources and technologies
Introduction Lemmatisation Methodology Conclusion Introduction II 31 March 2009; Athens • Focus: Using technology transfer for • Improving existing linguistic resources • Fast-tracking development • Improving an existing Setswana lemmatiser by applying a method developed for Afrikaans
Introduction Lemmatisation Methodology Conclusion Lemmatisation: Overview Overview Setswana Afrikaans 31 March 2009; Athens • Process whereby the inflected forms of a word are converted/normalised under the lemma or base form • swim, swimming, swam -> swim • Lemmatisation is an important process for many NLP tasks • Information Retrieval • Morphological Analysis
Introduction Lemmatisation Methodology Conclusion Lemmatisation: Overview Overview Setswana Afrikaans 31 March 2009; Athens • Not to be confused with Stemming • The process whereby a word is reduced to its stem by removing both inflectional and derivational morphemes • Two popular approaches to lemmatisation • Rule-based approach • Statistically/data-driven approach
Introduction Lemmatisation Methodology Conclusion Lemmatisation: Setswana Overview Setswana Afrikaans 31 March 2009; Athens • First Automatic Lemmatiser for Setswana developed by Brits (2006) • Found that only stems (and not roots) can act independently as words • Stems should be accepted as lemmas • Brits formalised rules for determining lemmas • Implemented as Finite-state transducers • Accuracy: 62.17% when evaluated on a dataset containing 295 randomly selected words
Introduction Lemmatisation Methodology Conclusion Lemmatisation: Afrikaans Overview Setswana Afrikaans 31 March 2009; Athens • 2003: Ragel – Accuracy of 67% when evaluated on a 1,000 word data set • Disappointing accuracy motivated development of another lemmatiser using a different approach • New Lemmatiser called Lia • Based on data-driven machine learning method • 73,000 lemma-annotated words • Accuracy 92,8% on new data • Motivated the application of machine learning methods for lemmatisation in Setswana
Introduction Lemmatisation Methodology Conclusion Methodology: Memory-based Learning Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Based on k-NN algorithm • All instances of a certain problem correspond to points in a n-dimensional space • Nearest neighbours computed by some form of distance metric
Introduction Lemmatisation Methodology Conclusion Methodology: Architecture Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Based on k-NN algorithm • All instances of a certain problem correspond to points in a n-dimensional space • Nearest neighbours computed by some form of distance metric
Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • MBL requires large amounts of data • Only 2,947 lemma-annotated Setswana words available (Brits’s evaluation set) • 2,947 words are a very small data set in memory-based learning terms
Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • MBL requires that lemmatisation be performed as a classification task • Data should consist of feature vectors with assigned class labels • Feature vectors: letters of the word • Class label: Transformation from word to lemma
Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Deriving class labels • Longest common substring • Indicates the string that needs to be removed, as well as possible replacement strings during the transformation from word form to lemma • Positions of the character strings that need to be removed are indicated as L (left) or R (right) • If the word form and lemma are identical, the awarded class is “0”
Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Deriving classes
Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Data • 90% for training • 10% for evaluation • First version (default algorithmic parameters) • 46.25% Accuracy • Parameter optimisation • 58.98% • Accuracy is below that of the rule-based version of Brits
Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Error analysis indicated obvious mistakes
Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Solution: Add class distributions to the output and implement a “back-off” mechanism Resulted in a further increase in accuracy: 64.06%
Introduction Lemmatisation Methodology Conclusion Conclusion 31 March 2009; Athens • The machine learning-based lemmatiser is only 1.9% more accurate than the rule-based version • Small in comparison to the 25% increase obtained for Afrikaans • Size of the training data • 2,652 words compared to 73,000 for Afrikaans • Increasing the amount of training data will increase the accuracy • Most important result: Technology Transfer
Introduction Lemmatisation Methodology Conclusion Acknowledgements 31 March 2009; Athens The work of Jeanetta H. Brits, performed under the supervision of Rigardt Pretorius and Gerhard B. van Huyssteen