1 / 13

Enhancing Data Quality in Crowdsourcing: Rejecting Spammers for Higher Accuracy

Explore how to improve crowdsourced data quality by leveraging redundancy and rejecting spammers. Learn cost-efficient methods to ensure accurate results and combat biases in worker classifications.

corenem
Download Presentation

Enhancing Data Quality in Crowdsourcing: Rejecting Spammers for Higher Accuracy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spam? No, thanks! Panos Ipeirotis – New York University “Crowdsourcing Work” Meetup

  2. Panos Ipeirotis - Introduction • New York University, Stern School of Business “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu

  3. Example: Build an Adult Web Site Classifier • Need a large number of hand-labeled sites • Get people to look at sites and classify them as: G (general), PG(parental guidance), R (restricted), X (porn) • Cost/Speed Statistics • Undergrad intern: 200 websites/hr, cost: $15/hr • MTurk: 2500 websites/hr, cost: $12/hr

  4. Bad news: Spammers! • Worker ATAMRO447HWJQ • labeled X (porn) sites as G (general audience)

  5. Improve Data Quality through Repeated Labeling • Get multiple, redundant labels using multiple workers • Pick the correct label based on majority vote 11 workers 93% correct 1 worker 70% correct • Probability of correctness increases with numberof workers • Probability of correctness increases with quality of workers

  6. But Majority Voting is Expensive • Single Vote Statistics • MTurk: 2500 websites/hr, cost: $12/hr • Undergrad: 200 websites/hr, cost: $15/hr • 11-vote Statistics • MTurk: 227 websites/hr, cost: $12/hr • Undergrad: 200 websites/hr, cost: $15/hr

  7. Using redundant votes, we can infer worker quality • Look at our spammer friend ATAMRO447HWJQ together with other 9 workers • We can compute error rates for each worker • Error rates for ATAMRO447HWJQ • P[X → X]=9.847% P[X → G]=90.153% • P[G → X]=0.053% P[G → G]=99.947% Our “friend” ATAMRO447HWJQmainly marked sites as G.Obviously a spammer…

  8. Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 45.2% • P[X → X]=9.847% P[X → G]=90.153% • P[G → X]=0.053% P[G → G]=99.947% Action: REJECT and BLOCK Results: • Over time you block all spammers • Spammers learn to avoid your HITS • You can decrease redundancy, as quality of workers is higher

  9. After rejecting spammers, quality goes up • Spam keeps quality down • Without spam, workers are of higher quality • Need less redundancy for same quality • Same quality of results for lower cost Without spam 5 workers 94% correct Without spam 1 worker 80% correct With spam 11 workers 93% correct With spam 1 worker 70% correct

  10. Correcting biases • Classifying sites as G, PG, R, X • Sometimes workers are careful but biased • Classifies G → P and P → R • Average error rate for ATLJIK76YH1TF: 45.0% • Error Rates for Worker: ATLJIK76YH1TF • P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% • P[P → G]=0.0% P[P → P]=0.0%P[P → R]=100.0% P[P → X]=0.0% • P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% • P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% Is ATLJIK76YH1TF a spammer?

  11. Correcting biases • Error Rates for Worker: ATLJIK76YH1TF • P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% • P[P → G]=0.0% P[P → P]=0.0%P[P → R]=100.0% P[P → X]=0.0% • P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% • P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% • For ATLJIK76YH1TF, we simply need to compute the “non-recoverable” error-rate (technical details omitted) • Non-recoverable error-rate for ATLJIK76YH1TF: 9% • Technical hint: The “condition number” of the matrix [how easy is to invert the matrix] is a good indicator of spamminess

  12. Too much theory? Open source implementation available at: http://code.google.com/p/get-another-label/ • Input: • Labels from Mechanical Turk • Cost of incorrect labelings (e.g., XG costlier than GX) • Output: • Corrected labels • Worker error rates • Ranking of workers according to their quality • Alpha version, more improvements to come! • Suggestions and collaborations welcomed!

  13. Thank you!Questions? “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu

More Related