690 likes | 843 Views
CROWDSOURCING. Massimo Poesio Part 2: Games with a Purpose. GAMES WITH A PURPOSE. Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic).
E N D
CROWDSOURCING Massimo Poesio Part 2: Games with a Purpose
GAMES WITH A PURPOSE • Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK • GWAP do not rely on altruism or financial incentives to entice people to perform certain actions • The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP • Games at www.gwap.com • ESP • Verbosity • TagATune • Other games • Peekaboom • Phetch
ESP • The first GWAP developed by von Ahn and their group (2003 / 2004) • The problem: obtain accurate description of images to be used • To train image search engines • To develop machine learning approaches to vision • The goal: label the majority of the images on the Web
ESP: THE GAME • Two partners are picked at random from the large number of players online • They are not told who their partner is, and can’t communicate with them • They are both shown the same image • The goal: guess how their partner will describe the image, and type that description • Hence, the ESP game • If any of the strings typed by one player matches the string typed by the other player, they score points
THE CHALLENGE: SCORES • One of the motivating factors is to try to score as many points as possible • Hourly, daily, weekly, and monthly scores are shown
THE CHALLENGE: TIMING • Partners try to agree on as many images as they can during 2 ½ minutes • The termometer on the side indicates how many images they have agreed on • If they agree on 15 images they score bonus points
TABOO WORDS • To ensure the production of a large number of specific labels, some words are declared TABOO and not allowed • Taboo words are obtained from the game itself: any word that has been agreed upon by players who were shown a picture earlier becomes a taboo word for that image
GOOD LABELS, COMPLETING AN IMAGE • A label is considered “good” when more than N players produce it (with N a parameter of the game) • An image is “done” when its list of taboo words is so extensive that most players pass on it
IMPLEMENTATION • Pre-recorded game play • Especially at the beginning, and at quiet times, there won’t always be players to pair with • In these cases a player is paired against a recorded ‘hand’ of a previous game with the same picture • Cheating • Players could cheat in a number of ways, including agreeing on labels / playing against themselves • A number of mechanisms are in place against those cases • Selecting images
SOME STATISTICS • In the 4 months between August 9th 2003 and December 10th 2003 • 13630 players • 1.2 million labels for 293,760 images • 80% of players played more than once • By 2008: • 200,000 players • 50 million labels
ANALYSIS • The numbers indicate that the game is fun to play • Exciting factors: • Playing with a partner • Playing against time
QUALITY OF THE LABELS • For IMAGE SEARCH: • choose 10 labels among those produced and look at which images are returned • Compare labels produced by players with labels produced by participants in an experiment • 15 participants, 20 images among the 1000 with more than 5 labels • 83% of game labels also produced by participants • Manual assessment of labels (‘would you use these labels to describe this image?’) • 15 participants, 20 images • 85% of words rated useful
VERBOSITY • … or, the game approach to collecting commonsense knowledge • Motivation: slow progress both on CYC (5 million facts collected) and on Open Mind Commonsense (around 700,000 facts)
THE GAME • Based on an existing game, TABOO: • Players have to guess a word • One of the players gives hints concerning the word • In Verbosity, you have two players, the DESCRIBER and the GUESSER, and a SECRET WORD
TEMPLATES IN VERBOSITY • As in Open Mind Commonsense, templates are used to ensure that the relations / properties of interest are collected • The Describer produces hints by filling in a template
TEMPLATES • _ is a kind of _ • _ is used for _ • _ is typically near/in/on _ • _ is the opposite of _ / _ is related to _
EMULATION • As in ESP game, pre-recorded games are used when a player cannot be paired with another player • The asymmetry of the game causes a problem not encountered in ESP game • Describer: can just repeat behavior of previous describer • Guesser: not so easy
RESULTS • Only published results I’m aware of predate the actual release of the game so I don’t know about the QUANTITY • Quality: • Ask six raters whether 200 facts collected using Verbosity are ‘true’ • Around 85% success
PEEKABOOM • Objective: collect data about the presence of objects in images in order to train vision algorithms for object detection
THE GAME • Two players • They take turns at playing ‘Peek’ and ‘Boom’ • ‘Boom’ gets a picture with an associated word; ‘Peek’ has to guess what is the associated word • ‘Boom’ reveals parts of a picture to ‘Peek’ by clicking on it (each click reveals a circular area of 20 pixels of radius)
IMPLEMENTATION • Images and their labels come from ESP • Cheating: • Player queue (wait until next ‘matching interval’ – one every 10 seconds – to start playing) • IP address checks (to make sure players are not paired with themselves) • Blocking bots: ‘seed images’ (previously annotated) and blacklist
EVALUATION: USER STATISTICS • Usage: • 1 month in 2005 • 14,153 players • 1,122,998 completed rounds • Average person played around 158 images (or 72 minutes)
EVALUATION: ACCURACY OF DATA • Accuracy of bounding boxes • Choose 50 images played by at least two pairs • Have four volunteers make bounding boxes • OVERLAP(A,B) = AREA(A∩B) / AREA(A∪B) • Average: 0.75 • Accuracy of pings • 50 images as above • Three subject decide if ping is ‘inside the object’ • Result: 100%
SOME GENERAL LESSONS • von Ahn & Dabbish (2008) discuss the general approach and some lessons they took from their work
THREE TEMPLATES • OUTPUT AGREEMENT GAMES • Generalization of ESP • INVERSION-PROBLEM GAMES • INPUT-AGREEMENT GAMES
OUTPUT AGREEMENT GAMES • Two strangers are chosen among all potential players. They cannot see each other or communicate with each other. • In each round, both are given the same input • Game instructions say that players should produce same output as their partners • Winning condition: they produce the same output, possibly after a few attempts E.g.: ESP GAME.
INVERSION PROBLEM GAMES • Two strangers are chosen among all potential players. They cannot see each other or communicate with each other. • In each round, one player is designated as the DESCRIBER whereas the other is designated as the GUESSER. The output from the describer should help the guesser guess the original input • WINNING CONDITION: The guesser correctly guesses the input originally assigned to the describer. E.g.: VERBOSITY. Based on ‘20 Questions’.
INPUT AGREEMENT GAMES • Two strangers are chosen among all potential players. They cannot see each other or communicate with each other. • In each round, both are given input that is known by the game (but not by the players) to be the same or different • Game instructions say that players should produce output describing their input so that they can decide whether input is same or different • Winning condition: playing partners correctly decide whether input is same or different. E.g.: TagATune.
INCREASE ENJOYMENT • Games designed so as to make the task enjoyable • GWAPs by von Ahn et al attempt to do this by giving players a CHALLENGE: • TIMED RESPONSE • SCORE KEEPING • SKILL LEVELS • HIGH SCORE LEVELS
OUTPUT ACCURACY • Mechanisms to ensure correctness and avoid collusions (e.g., always produce the same label) • Random matching (players don’t know each other’s identity) • Player testing (assess quality of particular player’s input by matching his output against already annotated data) • Repetition (output only considered correct if many players produced it) • Taboo
MISCELLANEOUS • Other useful ideas • Evaluation • Efficiency: THROUGHPUT (T) • ‘Enjoyability’: AVERAGE LIFETIME PLAY (ALP) • Combined measure: EXPECTED CONTRIBUTION = T * ALP
OTHER GAMES • On gwap.com • TagATune • Elsewhere: • FoldIt • Karaoke Callout • PheTch • Spectral Game