1 / 18

Crowdsourcing Blog Track Top News Judgments at TREC

Crowdsourcing Blog Track Top News Judgments at TREC. Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, ounis}@dcs.gla.ac.uk. Outline. Relevance Assessment and TREC (4 slides) Crowdsourcing Interface (4 slides)

velika
Download Presentation

Crowdsourcing Blog Track Top News Judgments at TREC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, ounis}@dcs.gla.ac.uk

  2. Outline • Relevance Assessment and TREC (4 slides) • Crowdsourcing Interface (4 slides) • Research Questions and Results (6 slides) • Conclusions and Best Practices (1 slide)

  3. Relevance Assessment and TREC Slides 4-7/20

  4. Relevance Assessment • Relevance assessments are vital when evaluating information retrieval (IR) systems at TREC • Is this document relevant to the information need expressed in the user query? • Created by human assessors • Specialist paid assessors, e.g. TREC assessors • Typically, only one assessor per judgement (for cost reasons) • Researchers themselves

  5. Limitations • Creating relevance assessments is costly • $$$ • Time • Equipment (lab, computers, electricity, etc) • May not scale well • How many people are available to make assessments • Can the work be done in parallel?

  6. Task • Could we do relevance assessment using crowdsourcing at TREC? • TREC 2010 Blog Track • Top news stories identification subtask • “What are the newsworthy stories on day d for a category c?” • Was the story ‘Sony Announces NGP’ an important story on the 1st February for the Science/Technology category? System Task: Crowdsourcing Task:

  7. Crowdsourcing Interface Slides 9-12/20

  8. Crowdsourcing HIT Interface Category c Externally Hosted iframe Instructions . . . Day d Story List of stories to be judged Judgment Assigned [+] Important [-] Not Important [x] Wrong Category [?] Not Yet Judged Comment Box Submit button

  9. External Interface • Interface was hosted from our servers • Requires Interaction • Catches out bots which only look for simple input fields to fill Worker Glasgow Server

  10. Manual Summary Evaluation Worker 1/2/3 • Hosting the judging interface externally allows us to record and reproduce what each worker sees • Can see at a glance whether the judgments make sense • Can compare across judgments easily • Can check whether the work has been done at all Is this a bot?

  11. Submitting Larger HITs • We have each worker judge 32 stories from a single day and category per HIT • Two reasons: • Newsworthiness is relative: Provides background for workers as to the stories of the day. • Promotes worker commitment in the task. 32 Stories

  12. Experimental Results Slides 14-20/20

  13. Research Questions • Was crowdsourcing Blog Track judgments fast and cheap? • Are there high levels of agreement between assessors? • Is having redundant judgments even necessary? • If we use worker agreement to infer multiple grades of importance, how would this effect the final ranking of systems at TREC? Was crowdsourcing a good idea? Can we do better?

  14. Experimental Setup • 8,000 news stories • statMAP pooling depth 32 • 50 topic days • Three workers per HIT • 24,000 judgments total • 750 HITs total • $0.50 per HIT • $412.50 total • (Includes 10% fees) • US Worker Restriction • 6 Batches • Incremental improvements [O. Alonso, SIGIR’09]

  15. Is Crowdsourcing Relevance Assessments Fast and Cheap? • Quick? • First HITs accepted within 10 minutes of launch • Each batch took less than 5 hours • Cheap? • $412.50 ($0.0156 per judgment) • 38% above $2 per hour wage on average Batches quickly completed – not an issue Workers took less time than expected Few HITs per batch – might be difficult to find soon after launch Workers get faster over time

  16. Assessment Quality • Are the assessments of good quality? • Evaluate agreement between workers • Mean Agreement – 69% • Ellen Voorhees reported only 32.8% [ E.M. Voorhees. IPM, 2000]

  17. Do we need redundant judgments? • What would have happened to the ranking of TREC systems if we had only used a single worker per HIT? • Consider each of the three judgments per HIT from a Meta-Worker • System rankings are not stable in the top ranks • Do we need to average over three workers? Yes! Multiple Ranking Swaps by the top systems Two Groups: Top 3: ~0.15 apart Bottom 3: ~0.3 apart

  18. Conclusions and Best Practices • Crowdsourcing top stories relevance assessments can be done successfully at TREC • . . . But we need at least three assessors for each story • Best Practices • Don’t be afraid to use larger HITs • If you have an existing interface integrate it with MTurk • Gold Judgments are not the only validation method • Re-cost your HITs as necessary Questions?

More Related