Inf o Magnets : Making Sense of Corpus Data

InfoMagnets: Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

Topic Segmentation:Helping InfoMagnets Make Sense of Corpus Data Jaime Arguello Language Technologies Institute

Outline • InfoMagnets • Applications • Topic Segmentation • Evaluation of 3 Algorithms • Results • Conclusions • Q/A

InfoMagnets

InfoMagnets Applications • Behavioral Research 2 Publishable results (submitted to CHI) • CycleTalk Project, LTI • Netscan Group, HCII • Conversational Interfaces • Tutalk Gweon et al., (2005) • Guide authoring using pre-processed human-human sample conversations • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005)

Pre-processing Dialogue Transcribed conversations Topic “chunks” (2) C (1) A Topic Segmentation B Topic Clustering A B C

Topic Segmentation • Preprocess for InfoMagnets • Important computational linguistics problem! • Previous Work: • Marti Hearst’s TextTiling (1994) • Beeferman, Berger, and Lafferty (1997) • Barzilay and Lee (2004) NAACL best paper award! • Many others • But we are segmenting dialogue…

Topic Segmentation of Dialogue • Dialogue is Different: • Very little training data • Linguistic Phenomena • Ellipsis • Telegraphic content - And, most importantly … Coherence in dialogue is organized around a shared task, and not around asingle flow of information!

Coherence Defined Over Shared Task Multiple topic shifts in regions w/ no intersection of content words

Evaluation of 3 Algorithms • 22 student-tutor pairs • Thermodynamics • Conversation via chat interface • One coder • Results shown in terms of Pk Lafferty et al., 1999 • Significant tests: 2-tailed, t-tests

3 Baselines • NONE: no topic boundaries • ALL: every utterance marks topic boundary • EVEN: every 13th utterance marks topic boundary • avg topic length = 13 utterances

1st Attempt: TextTiling (Hearst, 1997) • Slide two adjacent “windows” down the text • Calculate cosine correlation at each step • Use correlation values to calculate “depth” • “Depth” values higher than a threshold correspond to topic shifts w1 w2

TextTiling Results • TextTiling performs worse than baselines • Difference not statistically significant • Why doesn’t it work?

TextTiling Results • Topic boundary set heuristically where correlation is 0 • Bad results, but still valuable!

2nd Attempt: Barzilay and Lee (2004) • Cluster utterances • Treat each cluster as a “state” • Construct HMM • Emissions: state-specific language models • Transitions: based on location and cluster-membership of the utterances • Viterbi re-estimation until convergence

B&L Results • B&L statistically better than TT, but not better than degenerate algorithms

B&L Results • Too fine grained topic boundaries • Fixed expressions(“ok”, “yeah”, “sure” ) • Remember: cohesion based on shared task • State-based language models sufficiently different?

Adding Dialogue Dynamics • Dialogue Act coding scheme • Developed for discourse analysis of human-tutor dialogues • 4 main dimensions: • Action • Depth • Focus • Control • Dialogue Exchange (Sinclair and Coulthart, 1975)

3rd Attempt: Cross-Dimensional Learning • X- dimensional learning (Donmez et al., 2004) • Use estimated labels on some dimensions to learn other dimensions • 3 types of Features: • Text (discourse cues) • Lexical coherence (binary) • Dialogue Acts labels • 10-fold cross-validation • Topic Boundaries learned on estimated labels, not hand coded ones!

X-Dimensional Learning Results • X-DIM statistically better than TT, degenerate algorithms AND B&L!

Statistically Significant Improvement

Future Directions • Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. • Explore other work in topic segmentation of dialogue

Summary • Introduction to InfoMagnets • Applications • Need for topic segmentation • Evaluation of other algorithms • Novel algorithm using X-dimensional learning w/statistically significant improvement

Q/A Thank you!

Inf o Magnets : Making Sense of Corpus Data

Inf o Magnets : Making Sense of Corpus Data

Presentation Transcript

Permanent Magnets Products

China’s Automotive Aftermarket: Making Sense of the Data

Superconductive magnets in LHC.

Data Analysis

Molecular Magnets in the Field of Quantum Computing

Bioinformatics: Making sense of functional genomics data

Making Sense of It:- What is Data Protection?

A.C. Magnets (II)

Making Sense of Your Dollars and Cents

PreAP Magnets

Everyday Uses of Magnets

Making Sense of Qualitative Data

Making sense of life stories: life course and narrative perspectives

Data Driven Instructional Leadership

Making Sense of Longitudinal D ata

InfoMagnets : Making Sense of Corpus Data

Making sense of SaaS

Data, Data Everywhere: Making Sense of the Sea of User Data

8J MAGNETS AND ELECTROMAGNETS

Making Sense of Qualitative Data: The Grounded Theory Approach to Discovery

Chapter 1: Exploring Data