180 likes | 279 Views
Crowdscale Shared Task Challenge 2013. Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine). Crowdsourcing. Collect data and knowledge at large scale. Experts: Time-consuming & expensive. Crowdsourcing: Combine many non-experts.
E N D
Crowdscale Shared Task Challenge 2013 Qiang Liu (UC Irvine), JianPeng (MIT CSAIL), Alexander Ihler (UC Irvine)
Crowdsourcing • Collect data and knowledge at large scale Experts: Time-consuming & expensive Crowdsourcing: Combine many non-experts
Crowdsourcing for Labeling • Goal: estimate true zifrom noisy labels {Lij}. … Tasks: … Workers:
Baseline Methods • Majority Voting • All the workers have the same performance
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) Worker j’ Answer True Answer
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter Worker j’ Answer True Answer
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter • Other methods: • GLAD [Whitehill et al 09], Belief propagation [Liu et al 12], Minimax entropy [Zhou et al 12] …
In Practice … • Model Selection • Standard models may not work • Special structures on the classes • Unbalanced labels
Two Datasets • Google Fact Judgment Dataset • 42,624 queries; 57 trained raters; 576 gold queries • Answers: {No, Yes, Skip} • CrowdFlower Sentiment Judgment Dataset • 98,980 questions; 1,960 workers; 300 gold queries • Answers:0 (Negative), 1 (Neutral), 2 (Positive), 3 (not related), 4(I can’t tell) • Special classes “skip”, “I can’t tell” • Ambiguity of queries
Evaluation Metric • Averaged Recall: • Special classes “skip”, “I can’t tell” • Included in the evaluation?
Important Properties • Unbalanced labels (on the gold data) 531 CrowdFlower Data Google Data 92 72 70 Only 9 instances in the reference data 57 26 19 No Yes Skip 9 1 (Neutral) 2 (Positive) 0 (Negative) 4(I can’t tell) 3 (not related)
Evaluation Metric • The importance of minority classes are up-weighted. 531 Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes” 26 19 • Difficult to predict minority classes • E.g., Only 9 “I can’t tell” in the gold data, difficult to generalize No Yes Skip Overfitting!
Google Fact Judgment Dataset • Model selection (MV, one/two-coin EM): • Majority vote is the best • 57 “trained” workers • High and uniform accuracies • But not good enough … 0.7 # of workers Workers’ accuracies
Google Fact Judgment Dataset • Our Algorithm: For each query i • Count the percentages of labels submitted by the raters: ci(yes), ci(no), ci(skip) ci(yes) > 0.4 labeli = yes ci(no) > 0.8 labeli= no otherwise labeli= skip End Return {labeli}
CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17
CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17 Removing Class 4 may improve performance
CrowdFlower Sentiment Judgment Dataset • Our algorithm: • Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes: 2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then endif Return