530 likes | 715 Views
CROWDSOURCING. Massimo Poesio Part 4: Dealing with crowdsourced data. THE DATA. The result of crowdsourcing in whatever form is a mass of often inconsistent judgments Need techniques for identifying reliable annotations and reliable annotators
E N D
CROWDSOURCING Massimo Poesio Part 4: Dealing with crowdsourced data
THE DATA • The result of crowdsourcing in whatever form is a mass of often inconsistent judgments • Need techniques for identifying reliable annotations and reliable annotators • In the Phrase Detectives context, to discriminate between genuine ambiguity and disagreements due to error
SOME APPROACHES • Majority voting • But: it ignores the substantial differences in behavior between annotators • Alternatives: • Removing bad annotators eg using clustering • Weighing annotators
LATENT MODELS OF ANNOTATION QUALITY • The problem of reaching a conclusion on the basis of judgments by separate experts that may often be in disagreement is a longstanding one in epidemiology • A number of techniques developed, including • Dawid and Skene 1979 (also used by Passonneau & Carpenter) • Latent Annotation model (Uebersax 1994) • Raykar et al 2010 • Recently, Carpenter has been developing an explicit Hierarchical Bayesian model (2008)
DAWID AND SKENE 1979 • Model consists of likelihood for • annotations (labels from annotators) • categories (true labels) for items given • annotator accuracies and biases • prevalence of labels • Frequentistsestimate 2–4 given 1 • Optional regularization of estimates (for 3 and 4)
A GENERATIVE MODEL OF THE ANNOTATION TASK • What all of these models do is to provide an EXPLICIT PROBABILISTIC MODEL of the observations in terms of annotators, labels, and items
THE DATA • K possible labels • J annotators • I number of items • N total number of annotations of the I items produced by the J annotators • y_{i,j}: label produced for item i by coder j
A GENERATIVE MODEL OF THE ANNOTATION TASK • The probabilistic model specifies the probability of a particular label on the basis of PARAMETERS specifying the behavior of the annotators, the prevalence of the labels, etc • In Bayesian models, these parameters are specified in terms of PROBABILITY DISTRIBUTIONS
THE PARAMETERS OF THE MODEL • z_i: the ACTUAL category of item i • Θ_{j,k,k’}: ANNOTATOR RESPONSE • the probability that annotator j labels an item as k’ when it belongs to category k • π_k: PREVALENCE • The probability that an item belongs to category k
DISTRIBUTIONS • Each of the parameters is characterized in terms of a PROBABILITY DISTRIBUTION • When we have some information on the data, these distributions can be used to characterize their behavior • E.g., annotators may be all equally good / there may be a skew • Otherwise just defaults
DISTRIBUTIONS • Prevalence of labels (PRIOR) • π ~ Dir(α) • Annotator j’s response to item of category k (PRIOR) • Θ_{j,k} ~ Dir(β_k) • True category of item i (LIKELIHOOD): • z_i ~ Categorical(π) • Label from j for item i (LIKELIHOOD): • y_{i,j} ~ Categorical(Θ_{j,z_i})
TYPES OF ANNOTATORS: SPAMMY (RESPONSE TO ALL ITEMS THE SAME)
TYPES OF ANNOTATORS: BIASED (HAS SKEW IN RESPONSE – COMMON IN LOW PREVALENCE DATA)
QUICK INTRO TO DIRICHLET • Dirichlet is often seen in Bayesian models (e.g., Latent Dirichlet Allocation, LDC) because it is a CONJUGATE PRIOR of the MULTINOMIAL distribution
CONJUGATE PRIOR • In Bayesian inference the objective is to compute a POSTERIOR on the basis of a LIKELIHOOD and a PRIOR • A CONJUGATE PRIOR of distribution D is a distribution such that if it is used for the prior, then the posterior also has that shape • E.g., ‘Dirichlet is a conjugate prior of the multinomial’ means that if the likelihood is a multinomial and the prior is Dirichlet then the posterior is also Dirichlet. NLE
CATEGORICAL • The categorical distribution is a generalization of the Bernoulli distribution that specifies the probability of a given outcome for a binary trial • E.g., the probability of getting a head in a coin toss • Cfr.: BINOMIAL distribution that specifies the probability of getting N heads
PROBABILISTIC INFERENCE • Probabilistic inference techniques are used to INFER the parameters from the data and therefore compute the probabilities and parameters • Often: Expectation Maximization (EM) • The EM implementation in R used by Carpenter & Passonneau to estimate the parameters available from • https://github.com/bob- carpenter/ anno
APPLICATION TO WORD SENSE DISTRIBUTION (CARPENTER & PASSONNEAU, 2013, 2014) • Carpenter and Passonneau used the Dawid and Skeene model to compare manual annotators with turkers on word sense disambiguation anno of the MASC corpus
THE MASC corpus • Manually annotated subcorpus (MASC) • 500K word subset of Open American National Corpus (OANC) • Multiple genres: technical manuals, poetry, news, dialogue, etc. • 16 types of annotation (not all manual) • part of speech, phrases, word sense, named entity, ... • 100 item word-sense corpus • balanced by genre and part-of-speech (noun, verb, adjective)
MASC WORDSENSE • 100 words balanced between adjs, nouns, & verbs • 1000 sentences for each word • Annotated using WordNet senses for these words • ~ 1M tokens
MASC Wordsense: annotation using trained annotators • pre-training on 50 items • independent labeling of 1000 items • 100 items labeled by 3 or 4 annotators • agreement on these 100 items reported • only single round of annotation, most items single annotated
Annotation using trained annotators • College students from Vassar, Barnard, Columbia • 2–3 years of work on project • General training plus per-word training • Supervised by • Becky Passonneau • Nancy Ide (maintainer of MASC) • ChristianeFellbaum (maintainer of WordNet)
Annotation using crowdsourcing • 45 randomly selected words balanced across nouns, verbs, and adjectives were reannotated using crowdsourcing • 1000 instances per word • 25+ annotators per instance • high number of annotators to – estimate difficulty– reject independence of labels
Differences from trained situation • Annotators not trained • Not told to look at WordNet • Each HIT: • 10 sentences for the same word • WordNet senses listed under the word
METHODS • Passonneau & Carpenter used their model to • Evaluate prevalence of labels in different ways • Evaluate annotator response
OTHER MODELS • Raykar et al, 2010 • Carpenter, 2008
RAYKAR ET AL 2010 • Simultaneously ESTIMATES THE GROUND TRUTH from noisy labels, produces an ASSESSMENT OF THE ANNOTATORS, and LEARNS A CLASSIFIER • Based on logistic regression • Bayesian (includes priors on the annotators)
ANNOTATORS • Annotator j characterized by her/his • SENSITIVITY: the ability to recognize positive cases • α_j = P(y_j=1|y=1) • SPECIFICITY: the ability to recognize negative cases • β_j = P(y_j=1|y=1)
RAYKAR ET AL Raykar et al propose a version of the EM algorithm that can be used to estimate P(O|θ) as well as sensitivity and specificity for each annotator Carpenter developed a fully Bayesian version of the approach based on gradient descent www.phrasedetectives.com
AMBIGUITY: REFERENT 15.12 M: we’re gonna take the engine E3 15.13 : and shove it over to Corning 15.14 : hook [it] up to [the tanker car] 15.15 : _and_ 15.16 : send it back to Elmira (from the TRAINS-91 dialogues collected at the University of Rochester)
AMBIGUITY: REFERENT About 160 workers at a factorythat made paper for the Kent filters were exposed to asbestos in the 1950s. Areas of the factorywere particularly dusty where the crocidolite was used. Workers dumped large burlap sacks of the imported material into a huge bin, poured in cotton and acetate fibers and mechanically mixed the dry fibers in a process used to make filters. Workers described "clouds of blue dust" that hung over parts of the factory, even though exhaust fans ventilated the area. www.phrasedetectives.com
AMBIGUITY: EXPLETIVES 'I beg your pardon!' said the Mouse, frowning, but very politely: 'Did you speak?' 'Not I!' said the Lory hastily. 'I thought you did,' said the Mouse. '--I proceed. "Edwin and Morcar, the earls of Mercia and Northumbria, declared for him: and even Stigand, the patriotic archbishop of Canterbury, found it advisable--"' 'Found WHAT?' said the Duck. 'Found IT,' the Mouse replied rather crossly: 'of course you know what "it" means.'
OTHER DATA: WORDSENSE DISAMBIGUATION (Passonneau et al 2010) And our ideas of what constitutes a FAIR wage on a FAIR return on capital are historically contingent … {sense1, sense1, sense1, sense2, sense2, sense2} … the federal government … is wrangling for its FAIR share of the dividend … {sense1, sense1, sense2, sense2, sense8, sense8}