1 / 75

Accelerating Corpus Annotation through Active Learning

Accelerating Corpus Annotation through Active Learning. Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA) ‏. March 15, 2008 – AACL. Our Situation. Operating on a fixed budget

frye
Download Presentation

Accelerating Corpus Annotation through Active Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Corpus Annotation through Active Learning Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA)‏ March 15, 2008 – AACL

  2. Our Situation • Operating on a fixed budget • Need morphologically annotated text • For a significant part of the project, we can only afford to annotate a subset of the text • Where to focus manual annotation efforts in order to produce a complete annotation of highest quality?

  3. Improving on Manual Tagging • Can we reduce the amount of human effort while maintaining high levels of accuracy? • Utilize probabilistic models for computer-aided tagging • Focus: Active Learning for Sequence Labeling

  4. Outline • Statistical Part-of-Speech Tagging • Active Learning • Query by Uncertainty • Query by Committee • Cost Model • Experimental Results • Conclusions

  5. Probabilistic POS Tagging <s> Perhaps the biggest hurdle

  6. Probabilistic POS Tagging <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  7. Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  8. Probabilistic POS Tagging RB .87 RBS .05 ... FW 6E-7 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  9. Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  10. Probabilistic POS Tagging RB .87 DT .95 RBS .05 NN .003 ... ... FW 6E-7 FW 1E-8 <s> Perhaps the biggest hurdle ? Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  11. Probabilistic POS Tagging RB .87 DT .95 JJS .67 RBS .05 NN .003 JJ .17 ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  12. Probabilistic POS Tagging RB .87 DT .95 JJS .67 RBS .05 NN .003 JJ .17 ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  13. Probabilistic POS Tagging RB .87 DT .95 JJS .67 NN .85 RBS .05 NN .003 JJ .17 VV .13 ... ... ... ... FW 6E-7 FW 1E-8 UH 3E-7 CD 2E-7 <s> Perhaps the biggest hurdle Task: predict a tag for a word in a given context Actually: predict a probability distribution over possible tags

  14. Probabilistic POS Tagging • Train a model to perform precisely this task: • Estimate a distribution over tags for a word in given contexts. • Probabilistic POS Tagging is a fairly easy computational task • On the Penn Treebank (Wall Street Journal text) • No context • Choosing most frequent tag for a given word • 92.45% accuracy • There are much smarter things to do • Maximum Entropy Markov Models

  15. MEMMs

  16. MEMMs

  17. Sequence Labeling • Insteadof predicting one tag, we must predict entire sequence • The best tag for each word might give an unlikely sequence • We use Viterbi algorithm—a dynamic programming approach that gives best sequence.

  18. Features for POS Tagging • Combination of lexical, orthographic, contextual, and frequency-based information. • Following Toutanova & Manning (2000) approximately • For each word: • the textual form of the word itself • the POS tags of the preceding two words • all prefix and suffix substrings up to 10 characters long • whether words contain hyphens, digits, or capitalization • Additional features • prefix and suffix substrings up to 10 characters long

  19. More Data for Less Work • The goal is to produce annotated corpora with less time and annotator effort • Active Learning • A machine learning meta-algorithm • A model is trained with the help of an oracle • Oracle helps with the annotation process • Human annotator(s)‏ • Simulated with selective querying of hidden labels on pre-labeled data

  20. Active Learning Data Sets • Annotated: data labeled by oracle • Unannotated: data not (yet) labeled by oracle • Test: labeled data used to determine accuracy

  21. Active Learning Models • Sentence Selecting Model • Model that chooses the sentences to be annotated • MEMM • Model that tags the sentences

  22. Test Data Active Learning Algorithm Unannotated Data Annotated Data

  23. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  24. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  25. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  26. Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  27. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  28. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  29. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  30. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  31. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  32. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  33. Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  34. Most “Important” Samples Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  35. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  36. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  37. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  38. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  39. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  40. Most “Important” Samples Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  41. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  42. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  43. Most “Important” Samples Oracle Test Data Active Learning Algorithm Unannotated Data Annotated Data Corrected Sentences Automatic Tagger Sentence Selector Model

  44. Test Data Active Learning Algorithm Unannotated Data Annotated Data Automatic Tagger Sentence Selector Model

  45. Test Data Active Learning Algorithm … Unannotated Data … Annotated Data Automatic Tagger Sentence Selector Model

More Related