Maximizing Labeling Quality with Multiple Noisy Labelers

Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos IpeirotisStern School of Business New York University

Motivation • Many task rely on high-quality labels for objects: • relevance judgments • duplicate database records • image recognition • song categorization • videos • Labeling can be relatively inexpensive, using Mechanical Turk, ESP game …

ESP Game (by Luis von Ahn)

Mechanical Turk Example “Are these two documents about the same topic?”

Mechanical Turk Example

Motivation • Labels can be used in training predictive models • Duplicate detection systems • Image recognition • Web search • But: labels obtained from above sources are noisy. This directly affects the quality of learning models • How can we know the quality of annotators? • How can we know the correct answer? • How can we use best noisy annotators?

Quality and Classification Performance Labeling quality increases  classification quality increases Q = 1.0 Q = 0.8 Q = 0.6 Q = 0.5

How to Improve Labeling Quality • Find better labelers • Often expensive, or beyond our control • Use multiple, noisy labelers: repeated-labeling • Our focus

Our Focus:Labeling using Multiple Noisy Labelers • Multiple labelers and resulting label quality • Multiple labelers and classification quality • Selective label acquisition

Majority Voting and Label Quality • Ask multiple labelers, keep majority label as “true” label • Quality is probability of majority label being correct P=1.0 P=0.9 P=0.8 P is probabilityof individual labelerbeing correct P=0.7 P=0.6 P=0.5 P=0.4

So… • Multiple noisy labelers improve quality • (Sometimes) quality of multiple noisy labelers better than quality of best labeler in set So, should we always get multiple labels?

Tradeoffs for Classification • Get more labels  Improve label quality  Improve classification • Get more examples  Improve classification Q = 1.0 Q = 0.8 Q = 0.6 Q = 0.5

Basic Labeling Strategies • Get as many data points as possible, one label each • Repeatedly-label everything, same number of times

Repeat-Labeling vs. Single Labeling Repeated Single P= 0.6, labeling quality K=5, #labels/example With high noise, repeated labeling better than single labeling

Repeat-Labeling vs. Single Labeling Single Repeated P= 0.8, labeling quality K=5, #labels/example With low noise, more (single labeled) examples better

Estimating Labeler Quality • (Dawid, Skene 1979): “Multiple diagnoses” • Assume equal qualities • Estimate “true” labels for examples • Estimate qualities of labelers given the “true” labels • Repeat until convergence

Selective Repeated-Labeling • We have seen: • With noise and enough (noisy) examples getting multiple labels better than single-labeling • Can we do better? • Select data points, in terms of uncertainty score, to allocate multi-label resource, e.g. {+,-,+,+,-,+,+} vs. {+,+,+,+}

Natural Candidate: Entropy • Entropy is a natural measure of label uncertainty: • E({+,+,+,+,+,+})=0 • E({+,-, +,-, +,- })=1 Strategy: Get more labels for high-entropy examples

What Not to Do: Use Entropy Improves at first, hurts in long run Entropy Round robin

Why not Entropy • In the presence of noise, entropy will be high even with many labels • Entropy is scale invariant • (3+ , 2-) has same entropy as (600+ , 400-)

Estimating Label Uncertainty (LU) • Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs} • Label uncertainty = tail of beta distribution Beta probability density function SLU 0.5 0.0 1.0

Label Uncertainty • p=0.7 • 5 labelers(3+, 2-) • Entropy ~ 0.97

Comparison Label Uncertainty Uniform, round robin

Model Uncertainty (MU) • However, we do not have only labelers • A classifier can also give us labels! • Model uncertainty: get more labels for ambiguous/difficult examples • Intuitively: make sure that difficult cases are correct + + - - - - - - - - + + + + ? - - - - - - - - + + + + + + + + - - - - - - - - + + - - - - + + - - - - + + ? ?

Label + Model Uncertainty • Label and model uncertainty (LMU): avoid examples where either strategy is certain

Comparison Model Uncertainty alone also improves quality Label + Model Uncertainty Label Uncertainty Uniform, round robin

Classification Improvement

Conclusions • Gathering multiple labels from noisy users is a useful strategy • Under high noise, almost always better than single-labeling • Selectively labeling using label and model uncertainty is more effective

More Work to Do • Estimating the labeling quality of each labeler • Increased compensation vs. labeler quality • Example-conditional quality issues (some examples more difficult than others) • Multiple “real” labels • Hybrid labeling strategies using “learning-curve gradient”

Other Projects • SQoUT projectStructured Querying over Unstructured Texthttp://sqout.stern.nyu.edu • Faceted Interfaces • EconoMining projectThe Economic Value of User Generated Contenthttp://economining.stern.nyu.edu

SQoUT: Structured Querying over Unstructured Text • Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)

SIGMOD’06, TODS’07, + in progress SQoUT: The Questions Text Databases Extraction System(s) Retrieve documents from database/web/archive Process documents Extract output tuples Questions: How to we retrieve the documents? How to configure the extraction systems? What is the execution time? What is the output quality?

Basic Idea Applications (in increasing order of difficulty) • Opinion mining an important application of information extraction • Opinions of users are reflected in some economic variable (price, sales) EconoMining ProjectShow me the Money! • Buyer feedback and seller pricing power in online marketplaces (ACL 2007) • Product reviews and product sales (KDD 2007) • Importance of reviewers based on economic impact (ICEC 2007) • Hotel ranking based on “bang for the buck” (WebDB 2008) • Political news (MSM, blogs), prediction markets, and news importance

Some Indicative Dollar Values Negative Positive captures misspellings as well Natural method for extracting sentiment strength and polarity good packaging -$0.56 Negative Positive? ? Naturally captures the pragmatic meaning within the given context

Thanks!Q & A?

Maximizing Labeling Quality with Multiple Noisy Labelers

Maximizing Labeling Quality with Multiple Noisy Labelers

Presentation Transcript