Crowdscreen : Algorithms for Filtering Data using Humans

Crowdscreen: Algorithms for Filtering Data using Humans AdityaParameswaran Stanford University (Joint work with Hector Garcia-Molina, Hyunjung Park, NeoklisPolyzotis, AdityaRamesh, and Jennifer Widom)

Crowdsourcing: A Quick Primer Asking the crowd for help to solve problems Why? Many tasks done better by humans • Is this a photo of a car? • Pick the “cuter” cat How? We use an internet marketplace Requester: Aditya Reward: 1$ Time: 1 day

Crowd Algorithms • Working on fundamental data processing algorithms that use humans: • Max [SIGMOD12] • Filter [SIGMOD12] • Categorize [VLDB11] • Cluster [KDD12] • Search • Sort • Using human unit operations: • Predicate Eval., Comparisons, Ranking, Rating Goal: Design efficient crowd algorithms

Efficiency: Fundamental Tradeoffs • Which questions do I ask humans? • Do I ask in sequence or in parallel? • How much redundancy in questions? • How do I combine the answers? • When do I stop? How long can I wait? Latency Uncertainty What is the desired quality? Cost How much $$ can I spend?

Filter Single Is this an image of Paris? Predicate 1 Dataset of Items Is the image blurry? Filtered Dataset Predicate 2 Predicate Does it show people’s faces? …… Predicate k Y Y N Item X satisfies predicate? Applications: Content Moderation, Spam Identification, Determining Relevance, Image/Video Selection, Curation, and Management, …

Parameters • Given: • Per-question human error probability (FP/FN) • Selectivity • Goal: Compose filtering strategies, minimizing across all items • Overall expected cost (# of questions) • Overall expected error Latency Uncertainty Cost

Our Visualization of Strategies continue decide PASS YESs decide FAIL 6 5 4 3 2 1 6 5 4 1 2 3 NOs

Common Strategies • Always ask X questions, return most likely answer • Triangular strategy • If X YES return “Pass”, Y NO return “Fail”, else keep asking. • Rectangular strategy • Ask until |#YES - #NO| > X, or at most Y questions • Chopped off triangle

Filtering: Outline • How do we evaluate strategies? • Hasn’t this been done before? • What is the best strategy? (Formulation 1) • Formal statement • Brute force approach • Pruning strategies • Probabilistic strategies • Experiments • Extensions

Evaluating Strategies Cost = (x+y) Pr. of reaching (x,y) Error = Pr. of reaching (x,y) and incorrectly filtered ∑ YESs ∑ 3 2 1 Pr. of reaching (x, y) = Pr. of reaching (x, y-1) and getting Yes + Pr. of reaching (x-1, y) and getting No 3 2 1 NOs

Hasn’t this been done before? • Solutions from elementary statistics guarantee the same error per item • Important in contexts like: • Automobile testing • Medical diagnosis • We’re worried about aggregate error over all items: a uniquely data-oriented problem • We don’t care if every item is perfect as long as the overall error is met. • As we will see, results in $$$ savings

What is the best strategy? Find strategy with minimum overall expected cost, such that • Overall expected error is less than threshold • Number of questions per item never exceeds m YESs 6 5 4 3 2 1 6 5 4 1 2 3 NOs

Brute Force Approaches Too Long! • Try all O(3p) strategies, p = O(m2) • Try all “hollow” strategies Too Long! YESs YESs 4 4 3 3 2 2 1 1 NOs 6 5 4 1 2 3 NOs 4 1 2 3

Pruning Hollow Strategies For every hollow strategy, there is a ladder strategy that is as good or better. YESs 4 3 2 1 6 NOs 5 4 1 2 3

Other Pruning Examples YESs YESs 6 6 5 5 4 4 3 3 2 2 1 1 Hollow 6 6 5 5 4 4 1 1 2 2 3 3 Ladder NOs NOs

Probabilistic Strategies • Probabilities: • continue(x, y), pass(x, y), fail(x, y) YESs (0,1,0) (0,1,0) (0,1,0) 3 (0.5,0.5,0) (0.5,0.5,0) (1,0,0) (0,0,1) 2 (1,0,0) (1,0,0) (1,0,0) (0,0,1) 1 (1,0,0) (0.5,0,0.5) (0,0,1) (1,0,0) 3 2 1 NOs

Best probabilistic strategy • Finding best strategy can be posed as a Linear Program! • Insight 1: • Pr of reaching (x, y) = Paths into (x, y) * Pr. of one path • Insight 2: • Probability of filtering incorrectly at a point is independent of number of paths • Insight 3: • At least one of pass(x, y) or fail(x, y) must be 0

Experimental Setup • Goal: Study cost savings of probabilistic relative to others • Parameters  Generate Strategies  Compute Cost • Two sample plots • Varying false positive error (other parameters fixed) • Varying selectivity (other parameters varying) Probabilisitic Deterministic Hollow Ladder Rect Growth Shrink

Varying false positive error

Varying selectivity

Other Issues and Factors • Other formulations • Multiple filters • Categorize (output >2 types) Ref: “Crowdscreen: Algorithms for filtering with humans” [SIGMOD 2012]

Natural Next Steps • Expertise • Spam Workers • Task Difficulty • Latency • Error Models • Pricing Skyline of cost, latency, error Algorithms

Related Work on Crowdsourcing • Workflows, Platforms and Libraries: Turkit [Little et al. 2009], HProc [Heymann 2010], CrowdForge [Kittur et al. 2011], Turkomatic [Kulkarni and Can 2011], TurKontrol/Clowder [Dai, Mausam and Weld 2010-11] • Games: GWAP, Matchin, Verbosity, Input Agreement, Tagatune, Peekaboom [Von Ahn & group 2006-10], Kisskissban [Ho et al. 2009], Foldit [Cooper et. al. 2010-11], Trivia Masster [Deutch et al. 2012] • Marketplace Analysis: [Kittur et al. 2008], [Chilton et al. 2010], [Horton and Chilton 2010], [Ipeirotis 2010] • Apps: VizWiz [Bigham et al. 2010], Soylent [Bernstein et al. 2010], ChaCha, CollabMap [Stranders et al. 2011], Shepherd [Dow et al. 2011] • Active Learning: Survey [Settles 2010],[Raykar et al. 2009-10], [Sheng et al. 2008], [Welinder et al. 2010], [Dekel 2010], [Snow et al. 2008], [Shahaf 2010], [Dasgupta, Langford et al. 2007-10] • Databases: CrowdDB [Franklin et al. 2011], Qurk [Marcus et al. 2011], Deco [Parameswaran et. al. 2011], Hlog [Chai et al., 2009] • Algorithms: [Marcus et al. 2011], [Gomes et al. 2011], [Ailon et al. 2008], [Karger et al. 2011],

Thanks for listening! Questions? SCHRÖDINGER’S CAT

Crowdscreen : Algorithms for Filtering Data using Humans

Crowdscreen : Algorithms for Filtering Data using Humans

Presentation Transcript

Data Stream Algorithms Intro, Sampling, Entropy

ECG Filtering

Architectures and Algorithms for Data Privacy

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Graph Algorithms

BİM 202 ALGORITHMS

Algorithms and Data Structures for Low-Dimensional Topology

Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu /~ demmel /SC11_tutorial

Introduction to Algorithms and Data Structures

The Data Link Layer

Abstract Data Types and Stacks

Introduction to Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

241-423 Advanced Data Structures and Algorithms

Randomized Algorithms and Motif Finding

241-423 Advanced Data Structures and Algorithms

The Basic Theory of Filtering

Image Filtering in the Spatial Domain

CS221: Algorithms and Data Structures Lecture #1 Complexity Theory and Asymptotic Analysis

Genetic Algorithms

CPSC 411 Design and Analysis of Algorithms

Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices

C++ Plus Data Structures