1 / 18

Crowdscale Shared Task Challenge 2013

Crowdscale Shared Task Challenge 2013. Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine). Crowdsourcing. Collect data and knowledge at large scale. Experts: Time-consuming & expensive. Crowdsourcing: Combine many non-experts.

Download Presentation

Crowdscale Shared Task Challenge 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crowdscale Shared Task Challenge 2013 Qiang Liu (UC Irvine), JianPeng (MIT CSAIL), Alexander Ihler (UC Irvine)

  2. Crowdsourcing • Collect data and knowledge at large scale Experts: Time-consuming & expensive Crowdsourcing: Combine many non-experts

  3. Crowdsourcing for Labeling • Goal: estimate true zifrom noisy labels {Lij}. … Tasks: … Workers:

  4. Baseline Methods • Majority Voting • All the workers have the same performance

  5. Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) Worker j’ Answer True Answer

  6. Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter Worker j’ Answer True Answer

  7. Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter • Other methods: • GLAD [Whitehill et al 09], Belief propagation [Liu et al 12], Minimax entropy [Zhou et al 12] …

  8. In Practice … • Model Selection • Standard models may not work • Special structures on the classes • Unbalanced labels

  9. Two Datasets • Google Fact Judgment Dataset • 42,624 queries; 57 trained raters; 576 gold queries • Answers: {No, Yes, Skip} • CrowdFlower Sentiment Judgment Dataset • 98,980 questions; 1,960 workers; 300 gold queries • Answers:0 (Negative), 1 (Neutral), 2 (Positive), 3 (not related), 4(I can’t tell) • Special classes “skip”, “I can’t tell” • Ambiguity of queries

  10. Evaluation Metric • Averaged Recall: • Special classes “skip”, “I can’t tell” • Included in the evaluation?

  11. Important Properties • Unbalanced labels (on the gold data) 531 CrowdFlower Data Google Data 92 72 70 Only 9 instances in the reference data 57 26 19 No Yes Skip 9 1 (Neutral) 2 (Positive) 0 (Negative) 4(I can’t tell) 3 (not related)

  12. Evaluation Metric • The importance of minority classes are up-weighted. 531 Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes” 26 19 • Difficult to predict minority classes • E.g., Only 9 “I can’t tell” in the gold data, difficult to generalize No Yes Skip Overfitting!

  13. Google Fact Judgment Dataset • Model selection (MV, one/two-coin EM): • Majority vote is the best • 57 “trained” workers • High and uniform accuracies • But not good enough … 0.7 # of workers Workers’ accuracies

  14. Google Fact Judgment Dataset • Our Algorithm: For each query i • Count the percentages of labels submitted by the raters: ci(yes), ci(no), ci(skip) ci(yes) > 0.4 labeli = yes ci(no) > 0.8 labeli= no otherwise labeli= skip End Return {labeli}

  15. CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17

  16. CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17 Removing Class 4 may improve performance

  17. CrowdFlower Sentiment Judgment Dataset • Our algorithm: • Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes: 2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then endif Return

  18. Thank You 

More Related