Crowdscale Shared Task Challenge 2013

Crowdscale Shared Task Challenge 2013 Qiang Liu (UC Irvine), JianPeng (MIT CSAIL), Alexander Ihler (UC Irvine)

Crowdsourcing • Collect data and knowledge at large scale Experts: Time-consuming & expensive Crowdsourcing: Combine many non-experts

Crowdsourcing for Labeling • Goal: estimate true zifrom noisy labels {Lij}. … Tasks: … Workers:

Baseline Methods • Majority Voting • All the workers have the same performance

Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) Worker j’ Answer True Answer

Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter Worker j’ Answer True Answer

Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter • Other methods: • GLAD [Whitehill et al 09], Belief propagation [Liu et al 12], Minimax entropy [Zhou et al 12] …

In Practice … • Model Selection • Standard models may not work • Special structures on the classes • Unbalanced labels

Two Datasets • Google Fact Judgment Dataset • 42,624 queries; 57 trained raters; 576 gold queries • Answers: {No, Yes, Skip} • CrowdFlower Sentiment Judgment Dataset • 98,980 questions; 1,960 workers; 300 gold queries • Answers:0 (Negative), 1 (Neutral), 2 (Positive), 3 (not related), 4(I can’t tell) • Special classes “skip”, “I can’t tell” • Ambiguity of queries

Evaluation Metric • Averaged Recall: • Special classes “skip”, “I can’t tell” • Included in the evaluation?

Important Properties • Unbalanced labels (on the gold data) 531 CrowdFlower Data Google Data 92 72 70 Only 9 instances in the reference data 57 26 19 No Yes Skip 9 1 (Neutral) 2 (Positive) 0 (Negative) 4(I can’t tell) 3 (not related)

Evaluation Metric • The importance of minority classes are up-weighted. 531 Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes” 26 19 • Difficult to predict minority classes • E.g., Only 9 “I can’t tell” in the gold data, difficult to generalize No Yes Skip Overfitting!

Google Fact Judgment Dataset • Model selection (MV, one/two-coin EM): • Majority vote is the best • 57 “trained” workers • High and uniform accuracies • But not good enough … 0.7 # of workers Workers’ accuracies

Google Fact Judgment Dataset • Our Algorithm: For each query i • Count the percentages of labels submitted by the raters: ci(yes), ci(no), ci(skip) ci(yes) > 0.4 labeli = yes ci(no) > 0.8 labeli= no otherwise labeli= skip End Return {labeli}

CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17

CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17 Removing Class 4 may improve performance

CrowdFlower Sentiment Judgment Dataset • Our algorithm: • Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes: 2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then endif Return

Thank You 

Crowdscale Shared Task Challenge 2013

Crowdscale Shared Task Challenge 2013

Presentation Transcript

Partners in a Shared Task

Task 5 of the Hercules Challenge

Mathletes Challenge 2013

Can Challenge 2013

Mathletes Challenge 2013

Mathletes Challenge 2013

The Evolution of Shared-Task Evaluation

MSM 2013 Challenge: Annotowatch

2013 VERMONT HOME ENERGY CHALLENGE

Shared Services Update Fall, 2013

EEI’s DG Task Force: The Challenge Ahead

TASK FORCE MCCOY 2013

Academic Senate Task Force on Shared Governance

Shared Governance Task Force

Levenshtein-distance-based post-processing shared task spotlight

CoNLL-X Shared Task on Multilingual Dependency Parsing

Shared Governance Task Force Report

Lego Challenge Task

Partners in a Shared Task

Task 2: Design a new task-based evaluation challenge

Six Book Challenge 2013

The ESCB Task Force on “Shared SDMX tools”