540 likes | 701 Views
Crowdsourcing. Introduction Client Motivations Tasks Categories Crowd Motivation Pros & Cons Quality Management Scale up with Machine Learning Workflows for Complex tasks Market evolution Reputation Systems ECCO, March 20,2011 corina.ciechanow@pobox.com
E N D
Crowdsourcing • Introduction • Client Motivations Tasks Categories • Crowd Motivation • Pros & Cons • Quality Management • Scale up with Machine Learning • Workflows for Complex tasks • Market evolution Reputation Systems • ECCO, March 20,2011 • corina.ciechanow@pobox.com • http://bitsofknowledge.waterloohills.com
Introduction • June 2006: Jeff Howe created the term for his article in the Wired magazine "The Rise of Crowdsourcing". • Elements:At least 2 actors:- Client/Requester - Crowd or community (an online audience)A Challenge:- What has to be done? Need, task, etc.- Reward: money, prize, other motivators.
Ex: “Adult Websites” Classification • Large number of sites to label • Get people to look at sites and classify them as: • G (general audience) • PG (parental guidance) • R (restricted) • X (porn) [Panos Ipeirotis. WWW2011 tutorial]
Ex: “Adult Websites” Classification • Large number of hand‐labeled sites • Get people to look at sites and classify them as: • G (general audience) • PG (parental guidance) • R (restricted) • X (porn) Cost/Speed Statistics: • Undergrad intern: 200 websites/hr, cost: $15/hr • MTurk: 2500 websites/hr, cost: $12/hr [Panos Ipeirotis. WWW2011 tutorial]
Client motivation • Need Suppliers:Mass work, Distributed work, or just tedious work Creative work Look for specific talent Testing Support To offload peak demands Tackle problems that need specific communities or human variety Any work that can be done cheaper this way.
Client motivation • Need customers! • Need Funding • Need to be Backed up • Crowdsourcing is your business!
Crowd Motivation • Money €€€ • Self-serving purpose (learning new skills, get recognition, avoid boredom, enjoyment, create a network with other profesionals) • Socializing, feeling of belonging to a community, friendship • Altruism (public good, help others)
Crowd Demography(background defines motivation) • The 2008 survey at iStockphoto indicates that the crowd is quite homogenous and elite. • Amazon’s Mechanical Turk workers come mainly from 2 countries: a) USAb) India
Client Tasks Parameters 3 main goals for a task to be done: • Minimize Cost (cheap) • Minimize Completion Time (fast) • Maximize Quality (good) Client has other goals when the crowd is not just a supplier
Pros • Quicker: Parallellism reduces time • Cheap, even free • Creativity, Innovation • Quality (depends) • Availability of scarce ressources: Taps on the ‘long tail’ • Multiple feedback • Allows to create a community (followers) • Business Agility • Scales up!
Cons • Lack of professionalism: Unverified quality • Too many answers • No standards • No organisation of answers • Not always cheap: Added costs to bring a project to conclusion • Too few participants if task or pay is not attractive • If worker is not motivated, lower quality of work
Cons • Global language barriers. • Different laws in each country: adds complexity • No written contracts, so no possibility of non-disclosure agreements. • Hard to maintain a long term working relationship with workers. • Difficulty managing a large-scale, crowdsourced project. • Can be targeted by malicious work efforts. • Lack of guaranteed investment, thus hard to convince stakeholders.
Quality Management Ex: “Adult Website” Classification • Bad news: Spammers! • Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) [Panos Ipeirotis. WWW2011 tutorial]
Quality ManagementMajority Voting and Label Quality • Ask multiple labelers, keep majority label as “true” label • Quality is probability of being correct
Dealing with Quality • Majority vote works best when workers have similar quality • Otherwise better to just pick the vote of the best worker • Or model worker qualities and combine Vote combination studies [Clemen and Winkler, 1999, Ariely et al. 2000] show that complex models work slightly better than simple average, but are less robust. • Spammers try to go undetected • Good willing workers may have bias difficult to set apart.
Human Computation Biases • Anchoring Effect: “Humans start with a first approximation (anchor) and then make adjustments to that number based on additional information.” [Tversky & Kahneman, 1974] • Priming: Exposure to one stimulus (as stereotypes) influences another [Shih et al., 1999] • Exposure Effect: Familiarity leads to liking...[Stone and Alonso, 2010] • Framing Effect: Presenting the same option in different formats leads to different answers. [Tversky and Kahneman, 1981] Need to remove sequential effects from human computation data…
Dealing with Quality • Use this process to improve quality: 1.Initialize by aggregating labels (using majority vote)2. Estimate error rates for workers (use aggregated labels)3. Change aggregate labels (using error rates, weight worker votes according to quality) Note: Keep labels for “example data” unchanged4. Iterate from Step 2 until convergence • Or Use exploration‐exploitation scheme:– Explore: Learn about the quality of the workers– Exploit: Label new examples using the quality In both cases, significant advantage on bad conditions like imbalanced datasets and bad workers
Effect of Payment: Quality • Cost does not affect quality [Mason and Watts, 2009, AdSafe] • Similar results for bigger tasks [Ariely et al, 2009] [Panos Ipeirotis. WWW2011 tutorial]
Effect of payment in #tasks • Payment incentives increase speed, though [Panos Ipeirotis. WWW2011 tutorial]
Optimizing Quality • Quality tends to remain the same, independent of completion time [Huang et al., HCOMP 2010]
Scale Up with Machine Learning Build an ‘Adult Website’ Classifier • Crowdsourcing is cheap but not free – Cannot scale to web without help Build automatic classification models using examples from crowdsourced data
Integration with Machine Learning • Humans label training data • Use training data to build model
Dealing w/Quality in Machine Learning Noisy labels lead to degraded task performance Labeling quality increases Classification quality increases
Tradeoffs for Machine Learning Models • Get more data Improve model accuracy • Improve data quality Improve classification
Tradeoffs for Machine Learning Models • Get more data: Active Learning, select which unlabeled example to label [Settles, http://active-learning.net/] • Improve data quality: Repeated Labeling, label again an already labeled example [Sheng et al. 2008, Ipeirotis et al, 2010]
Model Uncertainty (MU) • Model uncertainty: get more labels for instances that cause model uncertainty – for modeling: why improve training data quality if model already is certain there?(“Self‐healing” process:[Brodley et al, JAIR 1999] , [Ipeirotis et al NYU 2010])– for data quality, low‐certainty “regions” may be due to incorrect labeling of corresponding instances
Quality Rule of Thumb • With high quality labelers (80% and above): One worker per case (more data better) • With low quality labelers (~60%) Multiple workers per case (to improve quality) [Sheng et al, KDD 2008; Kumar and Lease, CSDM 2011]
Complex tasks:Handle answers through workflow • Q: “My task does not have discrete answers….” • A: Break into two Human Intelligence Tasks (HITs): – “Create” HIT – “Vote” HIT • Vote controls quality of Creation HIT • Redundancy controls quality of Voting HIT Catch: If “creation” very good, voting workers just vote “yes” – Solution: Add some random noise (e.g. add typos)
Photo description But the free-form answer can be more complex, not just right or wrong… TurkIt toolkit [Little et al., UIST 2010]: http://groups.csail.mit.edu/uid/turkit/
Description Versions • A partial view of a pocket calculator together with some coins and a pen. • ... • A close‐up photograph of the following items: A CASIO multi‐function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance. • … • A close‐up photograph of the following items: A CASIO multi‐function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance ‐ probably personal finance.
Collective Problem Solving • Exploration / exploitation tradeoff (Independence/or not) – Can accelerate learning, by sharing good solutions – But can lead to premature convergence on suboptimal solution [Mason and Watts, submitted to Science, 2011]
Independence or Not? • Building iteratively (lack of independent) allows better outcomes for image description task…In the FoldIt game, workers built on each other’s results • But lack of independence may cause high dependence on starting conditions and create Groupthink [Little et al, HCOMP 2010]
Exploration/Exploitation? • With high quality labelers (80% and above):
Group Effect • Individual search strategies affect group success: Players copying each other make less exploring lower probability of finding peak on a round
Workflow Patterns • Generate / Create • Find • Improve / Edit / Fix Creation • Vote for accept‐reject • Vote up, vote down, to generate rank • Vote for best / select top‐k Quality Control • Split task • Aggregate Flow Control • Iterate Flow Control
AdSafe Crowdsourcing Experience • Detect pages that discuss swine flu • – Pharmaceutical firm had drug “treating” (off-label) swine flu • – FDA prohibited pharmaceuticals to display drug ad in pages about swine flu • Two days to comply! • • Big fast-food chain does not want ad to appear: • – In pages that discuss the brand (99% negative sentiment) • – In pages discussing obesity
Adsafe Crowdsourcing ExperienceWorkflow to classify URLs • Find URLs for a given topic (hate speech, gambling, alcohol • abuse, guns, bombs, celebrity gossip, etc etc) • http://url‐collector.appspot.com/allTopics.jsp • • Classify URLs into appropriate categories • http://url‐annotator.appspot.com/AdminFiles/Categories.jsp • • Mesure quality of the labelers and remove spammers • http://qmturk.appspot.com/ • • Get humans to “beat” the classifier by providing cases where • the classifier fails • http://adsafe‐beatthemachine.appspot.com/
Market Design of Crowdsourcing • Aggregators: • Create a crowd or community. • Create a portal to connect a client to the crowd • Deal with workflow of complex tasks, like decomposition in simpler tasks and answer recomposition • Allow anonymity • Consumers can benefit from a crowd without the need to create it.
Market Design: Crude vs Intelligent Crowdsourcing • Intelligent Crowdsourcing uses an organized workflow to tackle CONS of crude crowdsourcing. Complex task is divided by experts, Given to relevant crowds, and not to everyone • Individual answers are recomposed by experts into general answer • Usually covert
Lack of Reputation and Market for Lemons “When quality of sold good is uncertain and hidden before transaction, prize goes to value of lowest valued good” [Akerlof, 1970; Nobel prize winner] • Market evolution steps:1. Employers pays $10 to good worker, $0.1 to bad worker2. 50% good workers, 50% bad; indistinguishable from each other3. Employer offers price in the middle: $54. Some good workers leave the market (pay too low)5. Employer revised prices downwards as % of bad increased6. More good workers leave the market… death spiral http://en.wikipedia.org/wiki/The_Market_for_Lemons
Reputation systems • Great number of reputation mechanisms • Challenges in the Design of Reputation Systems- Insufficient participation- Overwhelmingly positive feedback- Dishonest reports- Identity changes- Value imbalance exploitation (“milking the reputation”)
Reputation systems [Panos Ipeirotis. WWW2011 tutorial]
Reputation systems • Dishonest Reports1. Ebay “Riddle for a PENNY! No shipping‐Positive Feedback”. Sets agreement in order to be given unfairly high ratings by them.2 “Bad‐mouthing”: Same situation but to “bad‐mouth” other sellers that they want to drive out the market. • Design incentive‐compatible mechanism to elicit honest feedbacks [Jurca and Faltings 2003: pay rater if report matches next; Miller et al. 2005: use a proper score rule to value report; Papaioannou and Stamoulis 2005: delay next transaction over time] [Panos Ipeirotis. WWW2011 tutorial]
Reputation systemsIdentity changes • “Cheap pseudonyms”: easy to disappear and reregister under a new identity with almost zero cost. [Friedman and Resnick 2001] • Introduce opportunities to misbehave without paying reputational consequences. Increase the difficulty of online identity changes • Impose upfront costs to new entrants: allow new identities (forget the past) but make it costly to create them
Challenges for Crowdsourcing Markets • Two‐sided opportunistic behavior1. In e‐commerce markets, only sellers are likely to behave opportunistically. 2. In crowdsourcing markets, both sides can be fraudulent. • Imperfect monitoring and heavy‐tailed participationverifying the answers is sometimes as costly as providing them.- Sampling often does not work, due to heavy tailed participation distribution (lognormal, according to self‐reported surveys) [Panos Ipeirotis. WWW2011 tutorial]