180 likes | 306 Views
Crowdsourcing Blog Track Top News Judgments at TREC. Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, ounis}@dcs.gla.ac.uk. Outline. Relevance Assessment and TREC (4 slides) Crowdsourcing Interface (4 slides)
E N D
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, ounis}@dcs.gla.ac.uk
Outline • Relevance Assessment and TREC (4 slides) • Crowdsourcing Interface (4 slides) • Research Questions and Results (6 slides) • Conclusions and Best Practices (1 slide)
Relevance Assessment and TREC Slides 4-7/20
Relevance Assessment • Relevance assessments are vital when evaluating information retrieval (IR) systems at TREC • Is this document relevant to the information need expressed in the user query? • Created by human assessors • Specialist paid assessors, e.g. TREC assessors • Typically, only one assessor per judgement (for cost reasons) • Researchers themselves
Limitations • Creating relevance assessments is costly • $$$ • Time • Equipment (lab, computers, electricity, etc) • May not scale well • How many people are available to make assessments • Can the work be done in parallel?
Task • Could we do relevance assessment using crowdsourcing at TREC? • TREC 2010 Blog Track • Top news stories identification subtask • “What are the newsworthy stories on day d for a category c?” • Was the story ‘Sony Announces NGP’ an important story on the 1st February for the Science/Technology category? System Task: Crowdsourcing Task:
Crowdsourcing Interface Slides 9-12/20
Crowdsourcing HIT Interface Category c Externally Hosted iframe Instructions . . . Day d Story List of stories to be judged Judgment Assigned [+] Important [-] Not Important [x] Wrong Category [?] Not Yet Judged Comment Box Submit button
External Interface • Interface was hosted from our servers • Requires Interaction • Catches out bots which only look for simple input fields to fill Worker Glasgow Server
Manual Summary Evaluation Worker 1/2/3 • Hosting the judging interface externally allows us to record and reproduce what each worker sees • Can see at a glance whether the judgments make sense • Can compare across judgments easily • Can check whether the work has been done at all Is this a bot?
Submitting Larger HITs • We have each worker judge 32 stories from a single day and category per HIT • Two reasons: • Newsworthiness is relative: Provides background for workers as to the stories of the day. • Promotes worker commitment in the task. 32 Stories
Experimental Results Slides 14-20/20
Research Questions • Was crowdsourcing Blog Track judgments fast and cheap? • Are there high levels of agreement between assessors? • Is having redundant judgments even necessary? • If we use worker agreement to infer multiple grades of importance, how would this effect the final ranking of systems at TREC? Was crowdsourcing a good idea? Can we do better?
Experimental Setup • 8,000 news stories • statMAP pooling depth 32 • 50 topic days • Three workers per HIT • 24,000 judgments total • 750 HITs total • $0.50 per HIT • $412.50 total • (Includes 10% fees) • US Worker Restriction • 6 Batches • Incremental improvements [O. Alonso, SIGIR’09]
Is Crowdsourcing Relevance Assessments Fast and Cheap? • Quick? • First HITs accepted within 10 minutes of launch • Each batch took less than 5 hours • Cheap? • $412.50 ($0.0156 per judgment) • 38% above $2 per hour wage on average Batches quickly completed – not an issue Workers took less time than expected Few HITs per batch – might be difficult to find soon after launch Workers get faster over time
Assessment Quality • Are the assessments of good quality? • Evaluate agreement between workers • Mean Agreement – 69% • Ellen Voorhees reported only 32.8% [ E.M. Voorhees. IPM, 2000]
Do we need redundant judgments? • What would have happened to the ranking of TREC systems if we had only used a single worker per HIT? • Consider each of the three judgments per HIT from a Meta-Worker • System rankings are not stable in the top ranks • Do we need to average over three workers? Yes! Multiple Ranking Swaps by the top systems Two Groups: Top 3: ~0.15 apart Bottom 3: ~0.3 apart
Conclusions and Best Practices • Crowdsourcing top stories relevance assessments can be done successfully at TREC • . . . But we need at least three assessors for each story • Best Practices • Don’t be afraid to use larger HITs • If you have an existing interface integrate it with MTurk • Gold Judgments are not the only validation method • Re-cost your HITs as necessary Questions?