140 likes | 298 Views
Conversation Disentanglement in Sports Discourse. Anthony Wong 6/01/11. Importance of Topic. What is conversation disentanglement? Clustering task, diving a transcript into a number of smaller, separate conversations Conversation disentanglement has a couple practical applications:
E N D
Conversation Disentanglement in Sports Discourse Anthony Wong 6/01/11
Importance of Topic • What is conversation disentanglement? • Clustering task, diving a transcript into a number of smaller, separate conversations • Conversation disentanglement has a couple practical applications: • Summary generation • User-interface systems like automatic threading
Basis of my Approach • Michael Elsner and Eugene Charniak (2008) • Uses lexical and non-lexical features to cluster different threads • Time between utterances, same speaker, number of shared words, “content” words
Proposed Project Overview • Follow the methodology in Elsner and Charniak’s paper • Create and annotate a dataset of sports discourse • Use existing Elsner/Charniak model to provide a baseline classification results and see how well their model adapts to a different chat domain • Test out different feature combination to hopefully raise performance • ? – Compare results with Elsner/Charniak paper in some meaningful way
Progress so far • Retrieve and prepare data • Annotate data set • Test existing model as is on my data set • Test out different feature combinations • *Evaluate model performance
Annotating the data T1 715 KateC : Sam - this is going to be painful, isn't it? T1 715 SamHolako : I hope not Kate, but Howard, Nelson and Carter have killed the Raptors in the past T2 715 JaredWade : Classic Frisco. The Minnesota bathroom smells worse, I hear. T3 715 Anthony(RapsFan) : @Batman: His WP48 is the worst on the team. Andrea is terrible. He scores. That's about it. T3 715 Arnold : Holy impossibilities , Batman - that won't happen. T4 715 BretLaGree : Raja Bell and Mike Bibby just held a flop-off in the lane. Bell won. T5 715 Bobbo : Zach, Go hit up Cinnabun!!! worth the $$...write it off to ESPN anyway T5 715 ZachHarper : I don't think it works that way T6 715 Aras : Jared! T6 715 JaredWade : Aras.
Annotating the data • The annotated part of this transcript has 399 lines. • 177 unique threads. • The average conversation length is 2.25423728814 . • The median conversation length is 2 . • The entropy is 7.0155726118 bits. • The median chat has 0.0 interruptions per line. • The average block of 10 contains 6.25706940874 threads. • The line-averaged conversation density is 2.77944862155 .
Running Elsner model as is • T1 715 KateC : Sam - this is going to be painful, isn't it? • T2 715 SamHolako : I hope not Kate, but Howard, Nelson and Carter have killed the Raptors in the past • T3 715 JaredWade : Classic Frisco. The Minnesota bathroom smells worse, I hear. • T4 715 Anthony(RapsFan) : @Batman: His WP48 is the worst on the team. Andrea is terrible. He scores. That's about it. • T5 715 Arnold : Holy impossibilities , Batman - that won't happen. • T6 715 BretLaGree : Raja Bell and Mike Bibby just held a flop-off in the lane. Bell won. • T7 715 Bobbo : Zach, Go hit up Cinnabun!!! worth the $$...write it off to ESPN anyway • T8 715 ZachHarper : I don't think it works that way • T9 715 Aras : Jared! • T9 715 JaredWade : Aras.
Running Elsner model as is • 368 unique threads. • The average conversation length is 1.08423913043 . • The median conversation length is 1 . • The entropy is 8.48485646504 bits. • The median chat has 0.0 interruptions per line. • The average block of 10 contains 9.52699228792 threads. • The line-averaged conversation density is 1.42355889724 .
Editing the model and evaluation • Still in progress • A lot of room for improvement • Many different feature combinations to try • Need to get evaluation code running
Issues • Documentation for Elsner code is good, but my Python is not • Integration issues between my data and Elsner code • MEGA Model Optimization Package (megam)