ACM email corpus annotation analysis

ACM email corpus annotation analysis Andrew Rosenberg 2/26/2004

Overview • Motivation • Corpus Description • Kappa Shortcomings • Kappa Augmentation • Classification of messages • Corpus annotation analysis • Next step: Sharpening method • Summary

Motivation • The ACM email corpus annotation raises two problems. • By allowing annotators to assign a message one or two labels, there is no clear way to calculate an annotation statistic. • An augmentation to the kappa statistic is proposed • Interannotator reliability is low (K < .3) • Annotator reeducation and/or annotation material redesign are most likely necessary. • Available annotated data can be used, hypothetically, to improve category assignment.

Corpus Description • 312 email messages exchanged between the Columbia chapter of the ACM. • Annotated by 2 annotators with one ortwo of the following 10 labels • question, answer, broadcast, attachment transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat

Kappa Shortcomings • Before running ML procedures, we need confidence in assigning labels to the messages. • In order to compute kappa (below) we need to count up the number of agreements. • How do you determine agreement with an optional secondary label? • Ignore the secondary label?

Kappa Shortcomings (ctd.) • Ignoring the secondary label isn’t acceptable for two reasons. • It is inconsistent with the annotation guidelines. • It ignores partial agreements. • {a,ba} - singleton matches secondary • {ab,ca} - primary matches secondary • {ab,cb} - secondary matches secondary • {ab,ba} - secondary matches primary, and vice versa • Note: The purpose is not to inflate the kappa value, but to accurately assess the data.

Kappa Augmentation • When a labeler employs a secondary label, consider it as a single annotation divided between two categories • Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label • Singleton annotations assigned a score of 1.0 • Primary p • Secondary 1-p

Kappa Augmentation example Annotator labels Annotation Matrices with p=0.6

Kappa Augmentation example (ctd.) Annotation Matrices Agreement matrix

Kappa Augmentation example (ctd.) • To calculate p(E), use the relative frequencies of each annotators label usage. • Kappa is then computed as originally:

Classification of messages • This augmentation allows us to classify messages based their individual kappa’ values at different values of p. • Class 1: high kappa’ at all values of p. • Class 2: low kappa’ at all values of p. • Class 3: high kappa’ at p = 1.0 • Class 4: high kappa’ at p = 0.5 • Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is.

Corpus Annotation Analysis • Agreement is low at all values of p • K’(p=1.0) = 0.299 • K’(p=0.5) = 0.281 • Other views of the data will provide some insight into how to revise the annotation scheme. • Category distribution • Category co-occurrence • Category confusion • Class distribution • Category by class distribution

Corpus Annotation Analysis:Category Distribution

Corpus Annotation Analysis:Category Co-occurrence

Corpus Annotation Analysis:Category Confusion

Corpus Annotation Analysis:Class Distribution

Corpus Annotation Analysis:Category by Class Distribution-1/2 Class 1:const. high Class 2:const. low

Corpus Annotation Analysis:Category by Class Distribution-2/2 Class 3:low to high Class 4:high to low

Next step: Sharpening method • In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked: • Some annotators are “better” than others • Some messages are “easier to label” than others • By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message. • How do we rank annotators? Messages?

Sharpening Method (ctd.) • Ranking Annotators • Calculate kappa between each annotator and the rest of the group. • “Better” annotators have a higher agreement with the group • Ranking messages • Variance (or -p*log(p)) of label vector summed over annotators. • Messages with high variance are more consistently annotated

Sharpening Method (ctd.) • How do we use these ranks? • Weight the annotators based on their rank. • Recompute the message matrix with weighted annotator contributions. • Weight the messages based on their rank. • Recompute the kappa values with weighted message contributions. • Repeat these steps until the weights change beneath a threshold.

Summary • The ACM email corpus annotation raises two problems. • By allowing annotators to assign a message one or two labels, there is no clear way to calculate an annotation statistic. • An augmentation to the kappa statistic is proposed • Interannotator reliability is low (K < .3) • Annotator reeducation and/or annotation material redesign are most likely necessary. • Available annotated data can be used, hypothetically, to improve category assignment.

ACM email corpus annotation analysis