Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlowerMatteo Negri and YasharMehdad

Crowdsourcing • Wikipedia: Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to a large group of people or community (a crowd), through an open call.

Crowdsourcing services • Web and Logo Design:99designs (>72000 designers, from $150) • Brand names: namethis ($99 for the best 3 names after a 48 hour contest/voting session) • Business innovation:Chaordix (engage the crowd via the web to “submit, discuss, refine and rank ideas…”) • Advertising:Poptent (“connects video creators with Top Brands…”) • Software & usability testing: uTest (>18000 professionals to test Web, mobile, gaming and desktop apps) • Brainstorming / feedback: kluster (“brainstorming ideas from trusted people”) • Product redesign: redesignme (“…actively seeks out badly-designed products…users are then invited to complete design challenges”) • … • Data cleansing & entry / content creation: Amazon’s Mechanical TurkCrowdFlower

MTurk & CF Ipeirotis, 2010. New demographics of Mechanical Turk. • MTurk (www.mturk.com) launched in 2005 • Directly accessible only to US requesters • > 500.000 Workers from >100 countries • CF (www.crowdflower.com) launched in 2007 • channel to Mturkaccessible to non-US requesters

MTurk & CF Requester HITs Completed HITs Workers • Basic unit of work: "Human Intelligence Task" (HIT) • Simple, repetitive, hard to automate tasks • Prices from $0.01 to $10 (the end of un-supervised learning?) • Requester • Prepay the money • Publish HITs • Get results • Worker (aka “turker”) • Complete the HITs • Get paid

Sample HITS from MTurk (July 2, 2010) 199,799 HITS Transcribe this audio into text(audio length: 1h3'41’’). $13.37 Visit the given website and complete the short survey. About 5 minutes to complete.$1.00 Tweet a specified messageon your valid Twitter account with at least 200 followers. $1.00 Share Your Room Painting Project (photo + description). $1.00 Sell me your old college/university writing assignments and summaries (400+ words). I am looking for original writing done about university-level topics & readings. $0.50 Share a 16thbirthday party idea. 300 + words.$0.50 Click a link to a website, enter your zip code, click submit to test(Takes 10 Seconds). $0.50 Provide on my website quality improvement tip for Singers and aspiring vocalist looking for vocal training tips. $0.40 How good is your Refrigerator model?Share your experience!$0.25 Tell us a true, interesting story from your life about acne, pimples, zits. etc., like products you tried, bad dates, embarrassing moments, etc.$0.10 Download and rate my free Android App.$0.01 Adult/inappropriate video identification. You will view or scrub this video and decide if it contains adult material.$0.01

Sample NLP HITS 1 • Corpus collection • Given a topic, prepare a brief speech expressing your true opinion on the topic. Next, prepare a second brief speech expressing the opposite of your opinion • Word Sense Disambiguation • Given a text passage containing a target word w, select w’s most appropriate sense from a list • Word similarity • Assign numeric judgments of word similarity for 30 word pairs on a scale of [0,10] • Textual Entailment • Given two sentences, choose whether the second sentence can be inferred from the first. • Answer quality evaluation • Given a question-answer pair, rate the following 13 statements on scale of 1 to 5: “This answer provides enough information for the question”, “this is an easy to read answer”, … • Sentiment/polarity/bias classification • Given a list of short headlines, assign numeric judgments in the interval [0,100] rating the headline for six emotions (anger, disgust, fear, joy, sadness, surprise) and a single numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline

Sample NLP HITS 2 • Machine Translation evaluation • Given a source text, rank each of the 5 translations from Best to Worst • Speech transcription • listen to the utterance by using the audio player embedded in the task web page, and transcribe every audible word. You can replay the audio as many times as necessary to produce a satisfactory transcript. • Temporal ordering of events • Given a verb event pair, take a binary choice on whether the event described by the first verb occurs before or after the second. • Relation extraction • Given a text passage with two highlighted terms, indicate if one of the following relations hold between them: …

Sample NLP HITS 3 JAVASCRIPT API • Word alignment • link words in the source sentence to one or more target words or the empty word.

Popular, simple, fast, cheap,… • How to design HITS? • …to attract turkers • …to collect reliable data • …to boost speed • How to price HITS? • How to ensure quality control? • …to weed out untrustable workers • …to weed out spammers/cheaters • …to avoid money waste … BUT tricky!!!

A bunch of hints • Keep your HIT simple and concise • Difficult tasks = low agreement, few reliable results, slow progress • Try different settings before launching abig job • Different definitions of your HIT • Different payment amounts • Make cheating a hard task • Make successful completion with random clicks impossible • Use a gold standard • Use regional qualifications • Define your HIT in the appropriate language • Transform texts into images

The importance of gold data 1 HIT: Transcribe this audio into text(audio length: 1h3'41’’). $13.37 Agfdagfa ah ah ah! Valid result without gold standard!!! • Using a gold standard is optional but REMEMBER THAT: • You are going to pay only for successfully completed HITs!!! • MTurk +10% over the price of successfully completed HITs • CF +30% (!) • You need a criterion to discriminate successfully/unsuccessfully completed HITs • No criterion=ALL results are good (and paid!)

The importance of gold data 2 HIT: given two English words A and B, decide if they can be synonyms or not Data to be annotated No criterion=ALL results are good (and paid!)…another example

The importance of gold data 2 HIT: given two English words A and B, decide if they can be synonyms or not Valid results without gold standard!!! No criterion=ALL results are good (and paid!)…another example

Adding gold units 1 HIT: given two English words A and B, decide if they can be synonyms or not car automobile Data to be annotated Gold units volume table Sometimes it’s easy: gold units can be merged with the required annotations

Adding gold units 1 HIT: given two English words A and B, decide if they can be synonyms or not car automobile Gold units volume table #67911 Judgments made: 7 Gold Seen: 2 / Missed:1 Trust: 50% Sometimes it’s easy: gold units can be merged with the required annotations Worker #67911

Adding gold units 2 HIT 1: translate the given English sentence into Spanish HIT 2: summarize a 300 words story • HIT 3: Given a list of headlines, assign a numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline • One valid output Vs. multiple valid outputs • Known output Vs. unknown output • Data annotation Vs. survey/content creation Sometimes it’s harder: gold units cannot be directly merged with the required annotations

Adding gold units 2 HIT 1: translate the given English sentence into Spanish PROBLEM: Since there’s not ONE single good translation, we cannot directly check the quality of turkers’ work through comparison with a gold reference translation Sometimes it’s harder: gold units cannot be directly merged with the required annotations

Adding gold units 2 HIT 1: translate the given English sentence into Spanish HIT 1.0: given two sentences, S1 in English and S2 in Spanish, decide if S2 is a correct translation of S1. HIT1.1: translate the given English sentence S3 into Spanish. Gold units Data to be collected Sometimes it’s harder: gold units cannot be directly merged with the required annotations Possible solution: a 2-steps HIT (validation over gold units + translation)

AMT Vs CF

Next steps • Creation/publication of a job • A simple task: word similarity • Monitoring your job

Terminology • Unit (HIT) • Basic task given to each worker. • Assignment • Number of units each worker will do at a time. • Judgment • Completion of an assignment by an individual worker. • Job • Your published assignments waiting for judgment. • Cost = # Assignments * # Judgments per assignment * Pay per assignment

Creating a new job: word similarity HIT: select from a list of terms the most similar to the one extracted from the given sentence • Task: Given a sentence containing a term t, choose among a list of 3 terms t1,t2,t3the most similar to t. • Note: One valid output  simple gold standard creation! • gold units can be easily merged with the required annotations • 1-step HIT

A closer look at

Creating a job

1: upload data

2: define your HIT

3: calibration (optional)

4: ordering Gambit: payments company for social games! Players are paid with “chips” for taking simple, online jobs… • NOTE: • MTurk +10% over the price of successfully completed HITs • CF +30% (!)

Checking a job(progress and results)

Summary page

Preview

Workers NOTE: Only workers having seen at least 4 gold units, with >= 70% Trust are paid (and their work is retained)!

A trustable worker

Issues • How to design HITS? • …to attract turkers • …to collect reliable data • …to boost speed • How to price HITS? • What can we do with low budget? • Quality control, cheating/spam detection • Experts Vs non experts (correlation between the two groups, what to expect from non experts)

A recent experience…

Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush T: Wolfgang Amadeus Mozart was born in Salzburg. H: Mozart was born in Austria. T: Wolfgang Amadeus Mozart was born in Salzburg. H: Mozart nació en Austria. NAACL 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk Joint work with YasharMehdad

Monolingual TE Corpus (PASCAL RTE3) CLTE Corpus(English-Spanish) Validated T-H pairs Translation HIT Validation HIT Translated T-H pairs

DAY1 $24 Naïve methodology No qualification mechanisms • Very fast and cheap: • $12 for 800 translations in 1 hour • $12 for 5*800 validations in ~6 hours Poor quality of the results (61% rejections) Need of gold standard units!

DAYS 2-7 $58 Improving validation Gold units (50 positive/negative examples) Task definition in Spanish • Better results…still at low cost • 97% Accuracy on 20% of the retained translations • +25% in the validation costs • Considerable increase in duration • 4 days for the first iteration (many rejected judgments, automatic pausing mechanism in CF) Need of qualification mechanisms! More money to boost speed!

DAYS 8-10 $99.75 Improving translation Gold units (validity check) Regional qualification, as in Mturk (upon request) Payment increase • Better results… • less rejections (45%) • Automatic pausing avoided • Faster procedure • Doubling the payment, halved the accomplishment time

Summary • 800 English pairs (RTE3 Development Set) • 426 validated English/Spanish pairs in our CLTE Corpus • $99.75 spent to define a reliable and fast procedure • translation/validation cycles • non-redundant acquisitions • systematic use of gold units • simple binary decisions • Cost-effective solution • $30 to create the full corpus of 800 pairs • Some limitations found in the CrowdFlower service • lack of regional qualification (only available upon request) • lack of other qualification mechanisms • automatic pausing mechanisms

MTurk & CF Ipeirotis, 2010. New demographics of Mechanical Turk. • MTurk (www.mturk.com) launched in 2005 • Directly accessible only to US requesters • Workers from >100 countries • > 500.000 workers • ~47% from US (34% from India) • ~68 % women • ~52 % 22-40 years • ~70% to spend free time fruitfully (~15% for “primary” income purposes) • ~25% for 4-8 hours per week • ~60% earning less than $10 per week • >50% with college education • CF (www.crowdflower.com) launched in 2007 • channel to Mturkaccessible to non-US requesters

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad

Presentation Transcript

Utility data annotation via Amazon Mechanical Turk

Using Mechanical Turk for linguistic research

Amazon Mechanical Turk New York City Meet Up

Using Amazon Mechanical Turk for Product Term Annotation

Mechanical Turk and AWS Workshop

Outsourcing and Crowdsourcing

Crowdsourcing using Mechanical Turk: Quality Management and Scalability

Amazon Mechanical Turk Artificial Artificial Intelligence

Yashar Akrami

Mechanical Turk and AWS Workshop

Synchronous Experiments on Mechanical Turk

Amazon Mechanical Turk

Mechanical Turk

Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk

Identifying American Sign Language Attributes Using ASL Novices on Mechanical Turk

Amazon Mechanical Turk ( Mturk )

Crowdsourcing Using SIS

Using Definite Knowledge: NLP and nl_interface.pl

Computer Music Composition using Crowdsourcing and Genetic Algorithms

Crowdsourcing with Amazon Mechanical Turk