1 / 12

Andrew Rosenberg and Ed Binkowski 5/4/2004

Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. Andrew Rosenberg and Ed Binkowski 5/4/2004. Overview. Corpus Description Kappa Shortcomings Kappa Augmentation Classification of messages Next step: Sharpening method .

Download Presentation

Andrew Rosenberg and Ed Binkowski 5/4/2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points Andrew Rosenberg and Ed Binkowski 5/4/2004

  2. Overview • Corpus Description • Kappa Shortcomings • Kappa Augmentation • Classification of messages • Next step: Sharpening method HLT/NAACL '04

  3. Corpus Description • 312 email messages exchanged between the Columbia chapter of the ACM. • Annotated by 2 annotators with one ortwo of the following 10 labels • question, answer, broadcast, attachment transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat HLT/NAACL '04

  4. Kappa Shortcomings • Kappa is used to determine interannotator reliability and validate gold standard corpora. • p(A) - # observed agreements / # data points • p(E) - # expected agreements / # data points • How do you determine agreement with an optional secondary label? HLT/NAACL '04

  5. Kappa Shortcomings (ctd.) • Ignoring the secondary label isn’t acceptable for two reasons. • It is inconsistent with the annotation guidelines. • It ignores partial agreements. • {a,ba} - singleton matches secondary • {ab,ca} - primary matches secondary • {ab,cb} - secondary matches secondary • {ab,ba} - secondary matches primary, and vice versa • Note: The purpose is not to inflate the kappa value, but to accurately assess the data. HLT/NAACL '04

  6. Kappa Augmentation • When a labeler employs a secondary label, consider it as a single annotation divided between two categories • Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label • Singleton annotations assigned a score of 1.0 • Primary p • Secondary 1-p HLT/NAACL '04

  7. Kappa Augmentation:Counting Agreements • To calculate p(A), sum agreement scores and divide by number of messages. • Partial agreements are counted as follows: • p(E) is calculated using the relative frequencies of label use based on annotation vectors. Annotator 2 {ba} Annotator 1 {a} Score = 1*(1-p) + 0*p = (1-p) HLT/NAACL '04

  8. Classification of messages • This augmentation allows us to classify messages based their individual kappa’ values at different values of p. • Class 1: high kappa’ at all values of p. • Use in ML experiments. • Class 2: low kappa’ at all values of p. • Discard. • Class 3: high kappa’ at p = 1.0 • Ignore the secondary label. • Class 4: high kappa’ at p = 0.5 • Use to revise annotation manual. • Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is. HLT/NAACL '04

  9. Corpus Annotation Analysis:Class Distribution HLT/NAACL '04

  10. Next step: Sharpening method • How can a gold standard corpus be obtained when an annotation effort yields a low kappa? • In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked: • Some annotators are “better” than others • Some messages are “easier to label” than others • By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message. HLT/NAACL '04

  11. Sharpening Method (ctd.) • Ranking Annotators • “Better” annotators have a higher agreement with the group • Ranking messages • Messages with high variance over annotations are more consistently annotated • To improve confidence in annotations: • Weight annotator contributions, and recompute message rankings. • Weight message contributions, and recompute annotator rankings. • Repeat until convergence. HLT/NAACL '04

  12. Thank you.amaxwell@cs.columbia.eduebinkowski@juno.com HLT/NAACL '04

More Related