1 / 12

Creating the Annotated TDT-4 Y2003 Evaluation Corpus

Creating the Annotated TDT-4 Y2003 Evaluation Corpus. Stephanie Strassel, Meghan Glenn Linguistic Data Consortium - University of Pennsylvania {strassel, mlglenn@ldc.upenn.edu}. Data Collection/Preparation. Collection Multiple sources, languages October 2000 – July 2001 TDT-4 Corpus V1.0

Download Presentation

Creating the Annotated TDT-4 Y2003 Evaluation Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating the Annotated TDT-4 Y2003 Evaluation Corpus Stephanie Strassel, Meghan Glenn Linguistic Data Consortium - University of Pennsylvania {strassel, mlglenn@ldc.upenn.edu}

  2. Data Collection/Preparation • Collection • Multiple sources, languages • October 2000 – July 2001 • TDT-4 Corpus V1.0 • Arabic, Chinese, English only • October 2000 – January 2001 • Collection subsampled for annotation • Goal: Reduce licensing, transcription and segmentation costs • Broadcast sources: select 4 of 7 or 3 of 5 days, stagger selection to maximize coverage by day • Newswire sources: sampling consistent with previous years • No down-sampling of Arabic NW • Reference transcripts • Closed-caption text where available • Commercial transcription agencies otherwise • Spell-check names for English commercial transcripts • Provide initial story boundaries & timestamps • ASR Output & Machine Translation • TDT-4 Corpus V 1.1 • Incorporates patches to Mandarin ASR data to fix encoding; removes empty files

  3. TDT-4 Corpus Overview

  4. TDT Concepts • STORY • In TDT2, story is “a section containing at least two independent declarative clauses on same topic” • In TDT3, definition modified to capture annotators’ intuitions about what constitutes story • Distinction between “preview/teaser” and complete news story • TDT4 preserves this content-based story definition • Greater emphasis on consistent application of story definition among annotation crew • EVENT • A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences • TOPIC • An event or activity along with all directly related events and activities

  5. Topics for 2003 • 40 new topics selected, defined, annotated for 2003 evaluation • 20 from Arabic seed stories • 10 each from Mandarin, English • Topic selection strategy same as in 2002 • Arabic topics are somewhat different • Despite same selection strategy • First time we’ve had Arabic seed stories • “Topic well” is running dry • 80 news topics with high likelihood of cross-language hits from 4-month span!

  6. Selection Strategy • Team leaders examine randomly-selected seed story • Potential seeds balanced across corpus (source/date/lang) • Identify TDT-style seminal event within story • Apply rule of interpretation to convert event to topic • 13 rules state, for each type of seminal event, what other types of events should be considered related • No requirement that selected topics have cross-language hits • But team leaders use knowledge of corpus to select stories likely to produce hits in other language sources • Handful of “easily confusable” topics

  7. Rules of Interpretation 1. Elections, e.g. 30030: Taipei Mayoral Elections Seminal events include: a specific political campaign, election day coverage, inauguration, voter turnouts, election results, protests, reaction. Topic includes: the entire process, from announcements of a candidate's intention to run through the campaign, nominations, election process and through the inauguration and formation of a newly-elected official's cabinet or government. 2. Scandals/Hearings, e.g. 30038: Olympic Bribery Scandal 3.  Legal/Criminal Cases, e.g. 30003: Pinochet Trial 4.  Natural Disasters, e.g., 30002: Hurricane Mitch 5.  Accidents, e.g., 30014: Nigerian Gas Line Fire 6.  Acts of Violence or War, e.g., 30034: Indonesia/East Timor Conflict 7.  Science and Discovery News, e.g., 31019: AIDS Vaccine Testing Begins 8.  Financial News, e.g., 30033: Euro Introduced 9.  New Laws, e.g., 30009: Anti-Doping Proposals 10.  Sports News, e.g., 31016: ATP Tennis Tournament 11.  Political and Diplomatic Meetings, e.g., 30018: Tony Blair Visits China 12.  Celebrity/Human Interest News, e.g., 31036: Joe DiMaggio Illness 13.  Miscellaneous News, e.g., 31024: South Africa to Buy $5 Billion in Weapons

  8. Topic Research • Provides context • Annotators specialize in particular topics (of their choosing) • Includes timelines, maps, keywords, named entities, links to online resources for each topic • Feeds into annotation queries

  9. Topic Definition • Fixed format to enhance consistency • Seminal event lists basic facts – who/what/when/where • Topic explication spells out scope of topic and potential difficulties • Rule of interpretation link • Link to additional resources • Feeds directly into topic annotation

  10. Annotation Strategy • Overview • Search-guided complete annotation • Work with one topic at a time • Multiple stages for each topic; multiple iterations of each stage • Two-way topic labeling decision • Topic Labels • YES: story discusses the topic in a substantial way • NO: story does not discuss the topic at all, or only mentions the topic in passing without giving any information about the topic • No BRIEF in TDT-4 • “Not Easy” label for tricky decisions • Triggers additional QC

  11. Annotation Search Stages • Stage 1: Initial query • Submit seed story or keywords as query to search engine • Read through resulting relevance-ranked list • Label each story as YES/NO • Stop after finding 5-10 on-topic stories, or • After reaching “off-topic threshold” • At least 2 off-topic stories for every 1 OT read AND • The last 10 consecutive stories are off-topic • Stage 2: Improved query using OT stories from Stage 1 • Issue new query using concatenation of all known OT stories • Read and annotate stories in resulting relevance-ranked list until reaching off-topic threshold • Stage 3: Text-based queries • Issue new query drawn from topic research & topic definition documents plus any additional relevant text • Read and annotate stories in resulting relevance-ranked list until reaching off-topic threshold • Stage 4: Creative searching • Annotators instructed to use specialized knowledge, think creatively to find novel ways to identify additional OT stories

  12. Additional Annotation & QC • Top-Ranked Off-Topic Stories (TROTS) • Define search epoch • First 4 on-topic stories chronologically sorted • Find two highly-ranked off-topic documents for each topic-language • Precision • All on-topic (YES) stories reviewed by senior annotator to identify false alarms • All “not easy” off-topic stories reviewed • Adjudication • Review pooled site results and adjudicate cases of disagreement with LDC annotators’ judgments • Pooled 3 sites’ tracking results • Reviewed all purported LDC FAs • For purported LDC Misses • English and Arabic: reviewed cases where all 3 sites disagreed with LDC • Mandarin: reviewed cases where 2 or more sites disagreed with LDC

More Related