390 likes | 540 Views
Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically. Overview. Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion. Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion.
E N D
Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically
Overview • Introduction • End-User Requirements • Solution: Design & Implementation • Evaluation • Conclusion
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Language TechnologiesLess-resourced LanguagesMethodology Human Language Technologies • HLTs depends on availability of linguistic data • Specialized lexicons • Annotated and raw corpora • Formalized grammar rules • Creation of such resources • Expensive and protractive • Especially for less-resourced languages
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Language TechnologiesLess-resourced LanguagesMethodology Less-resourced Languages • "languages for which few digital resources exist; and thus, languages whose computerization poses unique challenges. [They] are languages with limited financial, political, and legal resources… " (Garrett, 2006) • Implicit in this definition: • Lacks human resources (little attention in research or discussions) • Lacks computational linguists working on these languages • Research question: • How could one facilitate development of linguistic data by enabling non-experts to collaborate in the computerization of less-resourced languages?
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Language TechnologiesLess-resourced LanguagesMethodology Methodology I • Empowering linguists and mother-tongue speakers to deliver annotated data • High quality • Shortest possible time • Escalate the annotation of linguistic data by mother-tongue speakers • User-friendly environments • Bootstrapping • Machine learning instead of rule-based techniques
Methodology II The general idea: Development of gold standards Development of annotated data Bootstrapping With the click of a button: Annotate data Train machine-learning algorithm Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Language TechnologiesLess-resourced LanguagesMethodology
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion AssumptionsInterviews Central Point of Departure I • Annotators are invaluable resources • Based on experiences with less-resourced languages • Annotators have mostly word processing skills • Used to a GUI-based environment • Usually limited skills in a computational or programming environment • Worst cases annotators have difficulties with • File management • Unzipping • Proper encoding of text files
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion AssumptionsInterviews Central Point of Departure II • Aim of this project: Enabling annotators to focus on what they are good at: Enriching data with expert linguistic knowledge • Training the machine learner occurs automatically
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion AssumptionsInterviews End-user Requirements I • Unstructured interviews with four annotators • What do you find unpleasant about your work as an annotator? • What will make your life as an annotator easier?
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion AssumptionsInterviews End-user Requirements II • What do you find unpleasant about your work as an annotator? • Repetitiveness • Lack of concentration/motivation • Feeling “useless” • Do not see results
Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion AssumptionsInterviews End-user Requirements III 2. What will make your life as an annotator easier? • Friendly environment (i.e. GUI-based, and not lists of words) • Bite-sizes of data rather than endless lists • Rather correct data than annotate from scratch • Program should already suggest a possible annotation • Click or drag • Reference works need to be available • Automatic data management
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Solution: TurboAnnotate • User-friendly annotating environment • Bootstrapping with machine learning • Creating gold standards/annotated lists • Inspired by DictionaryMaker (Davel and Peche, 2006) and Alchemist (University of Chicago, 2004)
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions DictionaryMaker Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Alchemist Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion 1 2 3
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Step 1: Create Gold Standard • Create gold standard • Independent test set for evaluating performance • 1000 random instances used • Annotator only has to select one data file
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion 1 2 3
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Step 2: Verify Annotations • New data sourced from base list • Automatically annotated by classifier • Presented to annotator in the "Annotate" tab
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions TurboAnnotate : Annotation Environment Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion 1 2 3
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Step 3: Verify Annotated Set • Bootstrapping – inspired by DictionaryMaker • 200 words per chunk – trained in background • Annotator verifies • Click “accept” or correct the instance • Verified data serve as training data • Iterative process till desired results
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion The Machine Learning System I • Tilburg Memory-Based Learner (TiMBL). • Wide success and applicability in the field of natural language processing • Available for research purposes • Relative ease to use • On the down-side • Performs best with large quantities of data • For the tasks of hyphenation and compound analysis, TiMBL performs well with small quantities of data
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion The Machine Learning System II • Default parameter settings used • Task specific feature selection • Performance is evaluated against gold standard • For hyphenation and compound analysis, accuracy is determined on word-level and not per instance
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Features I All input words converted feature vectors Splitting window Context 3 positions (left and right) Class Hyphenation: indicating a break Compound Analysis: 3 possible classes + indicating word boundary _ indicating valence morpheme = no break Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Features II Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion • Example: eksamenlokaal -‘examination room’
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Parameter Optimisation I Large variations in accuracy occur when parameter settings of MBL algorithms are changed Finding the best combination of parameters Exhaustive searches undesirable Slow and computationally expensive Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Functional Specifications & SolutionsTechnicalSpecifications & Solutions User Instructions Parameter Optimisation II Alternative: Paramsearch (Van den Bosch, 2005) delivers combinations of algorithmic parameters that are estimated to perform well PSearch Our own modification of Paramsearch Only implemented after all data has been annotated Ensures the best possible classifier Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
CriteriaAccuracy Effort Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Criteria • Two criteria • Accuracy • Human effort (time) • Evaluated on the tasks of hyphenation and compound analysis for Afrikaans and Setswana • Four human annotators • Two well-experienced in annotating • Two considered novices in the field
CriteriaAccuracy Effort Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Accuracy • Two kinds of accuracy • Classifier accuracy • Human accuracy • Expressed as percentage of correctly annotated words over total number of words • Gold standard excluded as training data
CriteriaAccuracy Effort Classifier Accuracy (Hyphenation) Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
CriteriaAccuracy Effort Human Accuracy Human accuracy Two separate unseen datasets of 200 words for each language First dataset annotated in an ordinary text editor The second dataset annotated withTurboAnnotate. Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
CriteriaAccuracy Effort Human Accuracy Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
CriteriaAccuracy Effort Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Effort I • Two questions • Is it faster to annotate with TurboAnnotate? • What would the predicted saving on human effort be on a large dataset?
CriteriaAccuracy Effort Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Effort II
CriteriaAccuracy Effort Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Human Effort III • 1 minute faster to annotate 200 words with TurboAnnotate • Larger dataset (40,000 words) • Difference of only circa 3.5 uninterrupted human hours • This picture changes when the effect of bootstrapping is considered • Extrapolating to 42,967 words • Saving of 51 hours (68%) for hyphenation • Saving of 9 hours (41%) for compound analysis
Conclusion Future WorkObtaining TurboAnnotate Acknowledgements Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Conclusion • TurboAnnotate helps to increase the accuracy of human annotators • Saves human effort
Conclusion Future WorkObtaining TurboAnnotate Acknowledgements Future Work Other lexical annotation tasks Creating lexicons for spelling checkers Creating data for morphological analysis Stemming Lemmatization Improve GUI Network solution Active Learning Experiment with C5.0 Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Conclusion Future WorkObtaining TurboAnnotate Acknowledgements TurboAnnotate Requirements: Linux Perl 5.8 Gtk+ 2.10 TiMBL 5.1 Open-source Available at http://www.nwu.ac.za/ctext Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion
Conclusion Future WorkObtaining TurboAnnotate Acknowledgements Introduction End-User Requirements Solution: Design & ImplementationEvaluationConclusion Acknowledgements • This work was supported by a grant from the South African National Research Foundation (GUN: FA2004042900059). • We also acknowledge the inputs and contributions of • Ansu Berg • Pieter Nortjé • Rigardt Pretorius • Martin Schlemmer • Wikus Slabbert