750 likes | 884 Views
Accelerating Corpus Annotation through Active Learning. Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA) . March 15, 2008 – AACL. Our Situation. Operating on a fixed budget
E N D
Accelerating Corpus Annotation through Active Learning Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA) March 15, 2008 – AACL
Our Situation • Operating on a fixed budget • Need morphologically annotated text • For a significant part of the project, we can only afford to annotate a subset of the text • Where to focus manual annotation efforts in order to produce a complete annotation of highest quality?
Improving on Manual Tagging • Can we reduce the amount of human effort while maintaining high levels of accuracy? • Utilize probabilistic models for computer-aided tagging • Focus: Active Learning for Sequence Labeling
Outline • Statistical Part-of-Speech Tagging • Active Learning • Query by Uncertainty • Query by Committee • Cost Model • Experimental Results • Conclusions
Probabilistic POS Tagging <s> Perhaps the biggest hurdle
Probabilistic POS Tagging <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 DT .95 JJS .67 RBS .05 NN .003 JJ .17 ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 DT .95 JJS .67 RBS .05 NN .003 JJ .17 ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging RB .87 DT .95 JJS .67 NN .85 RBS .05 NN .003 JJ .17 VV .13 ... ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 CD 2E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags
Probabilistic POS Tagging • Train a model to perform precisely this task: • Estimate a distribution over tags for a word in given contexts. • Probabilistic POS Tagging is a fairly easy computational task • On the Penn Treebank (Wall Street Journal text) • No context • Choosing most frequent tag for a given word • 92.45% accuracy • There are much smarter things to do • Maximum Entropy Markov Models
Sequence Labeling • Insteadof predicting one tag, we must predict entire sequence • The best tag for each word might give an unlikely sequence • We use Viterbi algorithm—a dynamic programming approach that gives best sequence.
Features for POS Tagging • Combination of lexical, orthographic, contextual, and frequency-based information. • Following Toutanova & Manning (2000) approximately • For each word: • the textual form of the word itself • the POS tags of the preceding two words • all prefix and suffix substrings up to 10 characters long • whether words contain hyphens, digits, or capitalization • Additional features • prefix and suffix substrings up to 10 characters long
More Data for Less Work • The goal is to produce annotated corpora with less time and annotator effort • Active Learning • A machine learning meta-algorithm • A model is trained with the help of an oracle • Oracle helps with the annotation process • Human annotator(s) • Simulated with selective querying of hidden labels on pre-labeled data
Active Learning Data Sets • Annotated: data labeled by oracle • Unannotated: data not (yet) labeled by oracle • Test: labeled data used to determine accuracy
Active Learning Models • Sentence Selecting Model • Model that chooses the sentences to be annotated • MEMM • Model that tags the sentences
Test Data Active Learning Algorithm Unannotated Data Annotated Data
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model
Test Data Active Learning Algorithm … Unannotated Data … Annotated Data Automatic Tagger Sentence Selector Model