Accelerating Corpus Annotation through Active Learning

Accelerating Corpus Annotation through Active Learning Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA)‏ March 15, 2008 – AACL

Our Situation • Operating on a fixed budget • Need morphologically annotated text • For a significant part of the project, we can only afford to annotate a subset of the text • Where to focus manual annotation efforts in order to produce a complete annotation of highest quality?

Improving on Manual Tagging • Can we reduce the amount of human effort while maintaining high levels of accuracy? • Utilize probabilistic models for computer-aided tagging • Focus: Active Learning for Sequence Labeling

Outline • Statistical Part-of-Speech Tagging • Active Learning • Query by Uncertainty • Query by Committee • Cost Model • Experimental Results • Conclusions

Probabilistic POS Tagging <s> Perhaps the biggest hurdle

Probabilistic POS Tagging <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 DT .95 JJS .67 RBS .05 NN .003 JJ .17 ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging RB .87 DT .95 JJS .67 NN .85 RBS .05 NN .003 JJ .17 VV .13 ... ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 CD 2E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

Probabilistic POS Tagging • Train a model to perform precisely this task: • Estimate a distribution over tags for a word in given contexts. • Probabilistic POS Tagging is a fairly easy computational task • On the Penn Treebank (Wall Street Journal text) • No context • Choosing most frequent tag for a given word • 92.45% accuracy • There are much smarter things to do • Maximum Entropy Markov Models

MEMMs

Sequence Labeling • Insteadof predicting one tag, we must predict entire sequence • The best tag for each word might give an unlikely sequence • We use Viterbi algorithm—a dynamic programming approach that gives best sequence.

Features for POS Tagging • Combination of lexical, orthographic, contextual, and frequency-based information. • Following Toutanova & Manning (2000) approximately • For each word: • the textual form of the word itself • the POS tags of the preceding two words • all prefix and suffix substrings up to 10 characters long • whether words contain hyphens, digits, or capitalization • Additional features • prefix and suffix substrings up to 10 characters long

More Data for Less Work • The goal is to produce annotated corpora with less time and annotator effort • Active Learning • A machine learning meta-algorithm • A model is trained with the help of an oracle • Oracle helps with the annotation process • Human annotator(s)‏ • Simulated with selective querying of hidden labels on pre-labeled data

Active Learning Data Sets • Annotated: data labeled by oracle • Unannotated: data not (yet) labeled by oracle • Test: labeled data used to determine accuracy

Active Learning Models • Sentence Selecting Model • Model that chooses the sentences to be annotated • MEMM • Model that tags the sentences

Test Data Active Learning Algorithm Unannotated Data Annotated Data

Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

Most “Important” Samples Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

Test Data Active Learning Algorithm … Unannotated Data … Annotated Data Automatic Tagger Sentence Selector Model

Accelerating Corpus Annotation through Active Learning