1 / 24

Inf o Magnets : Making Sense of Corpus Data

Inf o Magnets : Making Sense of Corpus Data. Jaime Arguello Language Technologies Institute. Topic Segmentation: Helping Inf o Magnets Make Sense of Corpus Data. Jaime Arguello Language Technologies Institute. Outline. InfoMagnets Applications Topic Segmentation

jada
Download Presentation

Inf o Magnets : Making Sense of Corpus Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InfoMagnets: Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

  2. Topic Segmentation:Helping InfoMagnets Make Sense of Corpus Data Jaime Arguello Language Technologies Institute

  3. Outline • InfoMagnets • Applications • Topic Segmentation • Evaluation of 3 Algorithms • Results • Conclusions • Q/A

  4. InfoMagnets

  5. InfoMagnets Applications • Behavioral Research 2 Publishable results (submitted to CHI) • CycleTalk Project, LTI • Netscan Group, HCII • Conversational Interfaces • Tutalk Gweon et al., (2005) • Guide authoring using pre-processed human-human sample conversations • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005)

  6. Pre-processing Dialogue Transcribed conversations Topic “chunks” (2) C (1) A Topic Segmentation B Topic Clustering A B C

  7. Topic Segmentation • Preprocess for InfoMagnets • Important computational linguistics problem! • Previous Work: • Marti Hearst’s TextTiling (1994) • Beeferman, Berger, and Lafferty (1997) • Barzilay and Lee (2004) NAACL best paper award! • Many others • But we are segmenting dialogue…

  8. Topic Segmentation of Dialogue • Dialogue is Different: • Very little training data • Linguistic Phenomena • Ellipsis • Telegraphic content - And, most importantly … Coherence in dialogue is organized around a shared task, and not around asingle flow of information!

  9. Coherence Defined Over Shared Task Multiple topic shifts in regions w/ no intersection of content words

  10. Evaluation of 3 Algorithms • 22 student-tutor pairs • Thermodynamics • Conversation via chat interface • One coder • Results shown in terms of Pk Lafferty et al., 1999 • Significant tests: 2-tailed, t-tests

  11. 3 Baselines • NONE: no topic boundaries • ALL: every utterance marks topic boundary • EVEN: every 13th utterance marks topic boundary • avg topic length = 13 utterances

  12. 1st Attempt: TextTiling (Hearst, 1997) • Slide two adjacent “windows” down the text • Calculate cosine correlation at each step • Use correlation values to calculate “depth” • “Depth” values higher than a threshold correspond to topic shifts w1 w2

  13. TextTiling Results • TextTiling performs worse than baselines • Difference not statistically significant • Why doesn’t it work?

  14. TextTiling Results • Topic boundary set heuristically where correlation is 0 • Bad results, but still valuable!

  15. 2nd Attempt: Barzilay and Lee (2004) • Cluster utterances • Treat each cluster as a “state” • Construct HMM • Emissions: state-specific language models • Transitions: based on location and cluster-membership of the utterances • Viterbi re-estimation until convergence

  16. B&L Results • B&L statistically better than TT, but not better than degenerate algorithms

  17. B&L Results • Too fine grained topic boundaries • Fixed expressions(“ok”, “yeah”, “sure” ) • Remember: cohesion based on shared task • State-based language models sufficiently different?

  18. Adding Dialogue Dynamics • Dialogue Act coding scheme • Developed for discourse analysis of human-tutor dialogues • 4 main dimensions: • Action • Depth • Focus • Control • Dialogue Exchange (Sinclair and Coulthart, 1975)

  19. 3rd Attempt: Cross-Dimensional Learning • X- dimensional learning (Donmez et al., 2004) • Use estimated labels on some dimensions to learn other dimensions • 3 types of Features: • Text (discourse cues) • Lexical coherence (binary) • Dialogue Acts labels • 10-fold cross-validation • Topic Boundaries learned on estimated labels, not hand coded ones!

  20. X-Dimensional Learning Results • X-DIM statistically better than TT, degenerate algorithms AND B&L!

  21. Statistically Significant Improvement

  22. Future Directions • Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. • Explore other work in topic segmentation of dialogue

  23. Summary • Introduction to InfoMagnets • Applications • Need for topic segmentation • Evaluation of other algorithms • Novel algorithm using X-dimensional learning w/statistically significant improvement

  24. Q/A Thank you!

More Related