1 / 28

Overview of the TDT 2001 Evaluation and Results

Overview of the TDT 2001 Evaluation and Results. Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001. Outline. TDT Evaluation Overview 2001 TDT Evaluation Result Summaries First Story Detection (FSD) Topic Detection Topic Tracking Link Detection

salena
Download Presentation

Overview of the TDT 2001 Evaluation and Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the TDT 2001 Evaluation and Results Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001

  2. Outline • TDT Evaluation Overview • 2001 TDT Evaluation Result Summaries • First Story Detection (FSD) • Topic Detection • Topic Tracking • Link Detection • Other Investigations www.nist.gov/TDT

  3. TDT 101 “Applications for organizing text” Terabytes of Unorganized data • 5 TDT Applications • Story Segmentation • Topic Tracking • Topic Detection • First Story Detection • Link Detection www.nist.gov/TDT

  4. TDT’s Research Domain • Technology challenge • Develop applications that organize and locate relevant stories from a continuous feed of news stories • Research driven by evaluation tasks • Composite applications built from • Automatic Speech Recognition • Story Segmentation • Document Retrieval www.nist.gov/TDT

  5. Definitions A topicis … a seminal event or activity, along with alldirectly related events and activities. A storyis … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event. www.nist.gov/TDT

  6. Example Topic Title: Mountain Hikers Lost • WHAT: 35 or 40 young mountain hikers were lost in an avalanche in France around the 20th of January. • WHERE: Orres, France • WHEN: January 1998 • RULES OF INTERPRETATION: • Rule #5. Accidents www.nist.gov/TDT

  7. TDT 2001 Evaluation Corpus • TDT3 + Supplemental Corpora used for the evaluation*† • TDT3 Corpus • Third consecutive use for evaluations • XXX stories, 4th Qtr. 1998 • Used for Tracking and Link Detection development test • Supplement of 35K stories added to TDT3 • No annotations • Data added from both 3rd and 4th Qtr. 1998 • Not used for FSD tests • LDC Annotations † • 120 fully annotated topics: divided into published and withheld sets • 120 partially annotated topics • FSD used all 240 topics • Topic Detection used the 120 fully annotated topics • Tracking and Link Detection used the 60 fully annotated withheld topics * see www.nist.gov/speech/tests/tdt/tdt2001 for details † see www.ldc.upenn.edu/Projects/TDT3/ for details www.nist.gov/TDT

  8. TDT3 Topic Division TDT 2000 Systems • Two topic sets: • Published topics • Withheld topics • Selection criteria: • 60 topics per set • 30 of the 1999 topics • 30 of the 2000 topics • Balanced by number of on-topic stories www.nist.gov/TDT

  9. TDT Evaluation Methodology • Evaluation tasks are cast as detection tasks: • YES there is a target, or NO there is not • Performance is measured in terms of detection cost: “a weighted sum of missed detection and false alarm probabilities”CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget) • CMiss = 1 and CFA=0.1 are preset costs • Ptarget = 0.02 is the a priori probability of a target • Detection Cost is normalized to generally lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)} • When based on the YES/NO decisions, it is referred to as the actual decision cost • Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA • Makes use of likelihood scores attached to the YES|NO decisions • Minimum DET point is the best score a system could achieve with proper thresholds www.nist.gov/TDT

  10. TDT: Experimental Control • Good research requires experimental controls • Conditions that affect performance in TDT • Newswire vs. Broadcast News • Manual vs. automatic transcription of Broadcast News • Manual vs. automatic story segmentation • Mono vs. multilingual language material • Topic training amounts and languages • Default automatic English translations of Mandarin vs. native Mandarin orthography • Decision deferral periods www.nist.gov/TDT

  11. Outline • TDT Evaluation Overview • 2001 TDT Evaluation Result Summaries • First Story Detection (FSD) • Topic Detection • Topic Tracking • Link Detection • Other Investigations www.nist.gov/TDT

  12. First Stories on two topics = Topic 1 = Topic 2 Not First Stories First Story Detection Results System Goal: • To detect the first story that discusses each topic • Evaluating “part” of a Topic Detection system, i.e., when to start a new cluster www.nist.gov/TDT

  13. 2001 TDT Primary FSD ResultsNewswire+BNews ASR, English texts,automatic story boundaries, 10 File Deferral www.nist.gov/TDT

  14. TDT Topic Detection Task System Goal: • To detect topics in terms of the (clusters of) storiesthat discuss them. • “Unsupervised” topic training • New topics must be detected as the incoming stories are processed. • Input stories are then associated with one of the topics. Topic 1 Story Stream Topic 2

  15. Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10 Mandarin Native Translated Mandarin www.nist.gov/TDT

  16. training data on-topic unknown unknown test data Topic Tracking Task System Goal: • To detect stories that discuss the target topic,in multiple source streams. • Supervised Training • Given Nt sample stories that discuss a given target topic • Testing • Find all subsequent stories that discuss the target topic www.nist.gov/TDT

  17. Primary Tracking ResultsNewswire+BNman, English Training:1 Positive-0 Negative www.nist.gov/TDT

  18. TDT Link Detection Task System Goal: • To detect whether a pair of stories discuss the same topic. (Can be thought of as a “primitive operator” to build a variety of applications) ? www.nist.gov/TDT

  19. Primary Link Det. ResultsNewswire+BNasr, Deferral=10 NTU’s threshholding is unusual Native Mandarin Mandarin Native Translated Mandarin www.nist.gov/TDT

  20. Outline • TDT Evaluation Overview • 2001 TDT Evaluation Result Summaries • First Story Detection (FSD) • Topic Detection • Topic Tracking • Link Detection • Other Investigations www.nist.gov/TDT

  21. Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10 www.nist.gov/TDT

  22. Topic Detection:False Alarm Visualization UMass1 • Systems behave very differently • IMHO a user would not like to use a high FA rate system • Perhaps False alarms should get more weight in the cost function • Outer Circle: Number of stories in a cluster • Light => cluster was mapped to a reference topic • Blue => unmapped cluster • Inner Circle: Number of on-topic stories Topic ID TNO1-late System clusters, ordered by size Topic ID ` System clusters, ordered by size

  23. Topic Detection:2000 vs. 2001 Index FilesMultilingual Text, Newswire + Broadcast News,Auto Boundaries, Deferral =10 • The 2000 test corpus covered 3 months • The 2001 corpus covered 6 months • 35K more stories • Might affect performance, BUT appears not to. www.nist.gov/TDT

  24. Topic Detection Evaluation via a Link-Style Metric • Motivation: • There is instability of measured performance during system tuning • Likely to be a direct result of the need to map reference topic clusters to system-defined clusters • We would like to avoid the assumption of independent topics www.nist.gov/TDT

  25. Topic Detection Evaluation via a Link-Style Metric • Evaluation Criterion: “Is this pair of stories discuss the same topic?” • If a story pair is on the same topic • A missed detection is declared if the system put the stories in different clusters • Otherwise, it’s a correct detection • If a pair of stories in not on the same topic • A false alarm is declared if the system put the stories in the same cluster • Otherwise, it’s a correct non-detection www.nist.gov/TDT

  26. Link-Based vs. Topic Detection Metrics: Parameter Optimization Sweep System 1: 62K Test Stories 98 Topics • The link curve is less erratic for System1 • Link curve is higher: What does this mean? System 2: 27K Test Stories 31 Topics www.nist.gov/TDT

  27. What can be learned? • Are all the experimental controls necessary? • Tracking performance degrades 50% going from manual to automatic transcription, and an additional 50% going to automatic boundaries • Cross-language issues still not solved • Most systems used only the required deferral period • Progress was modest: did the lack of a new evaluation corpus impede research? www.nist.gov/TDT

  28. Summary • TDT Evaluation Overview • 2001 TDT Evaluation Results • Evaluating Topic Detection with the Link-based metric is feasible, but inconclusive • The TDT3 corpus annotations are now public! www.nist.gov/TDT

More Related