120 likes | 247 Views
Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. Andrew Rosenberg and Ed Binkowski 5/4/2004. Overview. Corpus Description Kappa Shortcomings Kappa Augmentation Classification of messages Next step: Sharpening method .
E N D
Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points Andrew Rosenberg and Ed Binkowski 5/4/2004
Overview • Corpus Description • Kappa Shortcomings • Kappa Augmentation • Classification of messages • Next step: Sharpening method HLT/NAACL '04
Corpus Description • 312 email messages exchanged between the Columbia chapter of the ACM. • Annotated by 2 annotators with one ortwo of the following 10 labels • question, answer, broadcast, attachment transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat HLT/NAACL '04
Kappa Shortcomings • Kappa is used to determine interannotator reliability and validate gold standard corpora. • p(A) - # observed agreements / # data points • p(E) - # expected agreements / # data points • How do you determine agreement with an optional secondary label? HLT/NAACL '04
Kappa Shortcomings (ctd.) • Ignoring the secondary label isn’t acceptable for two reasons. • It is inconsistent with the annotation guidelines. • It ignores partial agreements. • {a,ba} - singleton matches secondary • {ab,ca} - primary matches secondary • {ab,cb} - secondary matches secondary • {ab,ba} - secondary matches primary, and vice versa • Note: The purpose is not to inflate the kappa value, but to accurately assess the data. HLT/NAACL '04
Kappa Augmentation • When a labeler employs a secondary label, consider it as a single annotation divided between two categories • Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label • Singleton annotations assigned a score of 1.0 • Primary p • Secondary 1-p HLT/NAACL '04
Kappa Augmentation:Counting Agreements • To calculate p(A), sum agreement scores and divide by number of messages. • Partial agreements are counted as follows: • p(E) is calculated using the relative frequencies of label use based on annotation vectors. Annotator 2 {ba} Annotator 1 {a} Score = 1*(1-p) + 0*p = (1-p) HLT/NAACL '04
Classification of messages • This augmentation allows us to classify messages based their individual kappa’ values at different values of p. • Class 1: high kappa’ at all values of p. • Use in ML experiments. • Class 2: low kappa’ at all values of p. • Discard. • Class 3: high kappa’ at p = 1.0 • Ignore the secondary label. • Class 4: high kappa’ at p = 0.5 • Use to revise annotation manual. • Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is. HLT/NAACL '04
Corpus Annotation Analysis:Class Distribution HLT/NAACL '04
Next step: Sharpening method • How can a gold standard corpus be obtained when an annotation effort yields a low kappa? • In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked: • Some annotators are “better” than others • Some messages are “easier to label” than others • By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message. HLT/NAACL '04
Sharpening Method (ctd.) • Ranking Annotators • “Better” annotators have a higher agreement with the group • Ranking messages • Messages with high variance over annotations are more consistently annotated • To improve confidence in annotations: • Weight annotator contributions, and recompute message rankings. • Weight message contributions, and recompute annotator rankings. • Repeat until convergence. HLT/NAACL '04
Thank you.amaxwell@cs.columbia.eduebinkowski@juno.com HLT/NAACL '04