120 likes | 239 Views
Creating the Annotated TDT-4 Y2003 Evaluation Corpus. Stephanie Strassel, Meghan Glenn Linguistic Data Consortium - University of Pennsylvania {strassel, mlglenn@ldc.upenn.edu}. Data Collection/Preparation. Collection Multiple sources, languages October 2000 – July 2001 TDT-4 Corpus V1.0
E N D
Creating the Annotated TDT-4 Y2003 Evaluation Corpus Stephanie Strassel, Meghan Glenn Linguistic Data Consortium - University of Pennsylvania {strassel, mlglenn@ldc.upenn.edu}
Data Collection/Preparation • Collection • Multiple sources, languages • October 2000 – July 2001 • TDT-4 Corpus V1.0 • Arabic, Chinese, English only • October 2000 – January 2001 • Collection subsampled for annotation • Goal: Reduce licensing, transcription and segmentation costs • Broadcast sources: select 4 of 7 or 3 of 5 days, stagger selection to maximize coverage by day • Newswire sources: sampling consistent with previous years • No down-sampling of Arabic NW • Reference transcripts • Closed-caption text where available • Commercial transcription agencies otherwise • Spell-check names for English commercial transcripts • Provide initial story boundaries & timestamps • ASR Output & Machine Translation • TDT-4 Corpus V 1.1 • Incorporates patches to Mandarin ASR data to fix encoding; removes empty files
TDT Concepts • STORY • In TDT2, story is “a section containing at least two independent declarative clauses on same topic” • In TDT3, definition modified to capture annotators’ intuitions about what constitutes story • Distinction between “preview/teaser” and complete news story • TDT4 preserves this content-based story definition • Greater emphasis on consistent application of story definition among annotation crew • EVENT • A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences • TOPIC • An event or activity along with all directly related events and activities
Topics for 2003 • 40 new topics selected, defined, annotated for 2003 evaluation • 20 from Arabic seed stories • 10 each from Mandarin, English • Topic selection strategy same as in 2002 • Arabic topics are somewhat different • Despite same selection strategy • First time we’ve had Arabic seed stories • “Topic well” is running dry • 80 news topics with high likelihood of cross-language hits from 4-month span!
Selection Strategy • Team leaders examine randomly-selected seed story • Potential seeds balanced across corpus (source/date/lang) • Identify TDT-style seminal event within story • Apply rule of interpretation to convert event to topic • 13 rules state, for each type of seminal event, what other types of events should be considered related • No requirement that selected topics have cross-language hits • But team leaders use knowledge of corpus to select stories likely to produce hits in other language sources • Handful of “easily confusable” topics
Rules of Interpretation 1. Elections, e.g. 30030: Taipei Mayoral Elections Seminal events include: a specific political campaign, election day coverage, inauguration, voter turnouts, election results, protests, reaction. Topic includes: the entire process, from announcements of a candidate's intention to run through the campaign, nominations, election process and through the inauguration and formation of a newly-elected official's cabinet or government. 2. Scandals/Hearings, e.g. 30038: Olympic Bribery Scandal 3. Legal/Criminal Cases, e.g. 30003: Pinochet Trial 4. Natural Disasters, e.g., 30002: Hurricane Mitch 5. Accidents, e.g., 30014: Nigerian Gas Line Fire 6. Acts of Violence or War, e.g., 30034: Indonesia/East Timor Conflict 7. Science and Discovery News, e.g., 31019: AIDS Vaccine Testing Begins 8. Financial News, e.g., 30033: Euro Introduced 9. New Laws, e.g., 30009: Anti-Doping Proposals 10. Sports News, e.g., 31016: ATP Tennis Tournament 11. Political and Diplomatic Meetings, e.g., 30018: Tony Blair Visits China 12. Celebrity/Human Interest News, e.g., 31036: Joe DiMaggio Illness 13. Miscellaneous News, e.g., 31024: South Africa to Buy $5 Billion in Weapons
Topic Research • Provides context • Annotators specialize in particular topics (of their choosing) • Includes timelines, maps, keywords, named entities, links to online resources for each topic • Feeds into annotation queries
Topic Definition • Fixed format to enhance consistency • Seminal event lists basic facts – who/what/when/where • Topic explication spells out scope of topic and potential difficulties • Rule of interpretation link • Link to additional resources • Feeds directly into topic annotation
Annotation Strategy • Overview • Search-guided complete annotation • Work with one topic at a time • Multiple stages for each topic; multiple iterations of each stage • Two-way topic labeling decision • Topic Labels • YES: story discusses the topic in a substantial way • NO: story does not discuss the topic at all, or only mentions the topic in passing without giving any information about the topic • No BRIEF in TDT-4 • “Not Easy” label for tricky decisions • Triggers additional QC
Annotation Search Stages • Stage 1: Initial query • Submit seed story or keywords as query to search engine • Read through resulting relevance-ranked list • Label each story as YES/NO • Stop after finding 5-10 on-topic stories, or • After reaching “off-topic threshold” • At least 2 off-topic stories for every 1 OT read AND • The last 10 consecutive stories are off-topic • Stage 2: Improved query using OT stories from Stage 1 • Issue new query using concatenation of all known OT stories • Read and annotate stories in resulting relevance-ranked list until reaching off-topic threshold • Stage 3: Text-based queries • Issue new query drawn from topic research & topic definition documents plus any additional relevant text • Read and annotate stories in resulting relevance-ranked list until reaching off-topic threshold • Stage 4: Creative searching • Annotators instructed to use specialized knowledge, think creatively to find novel ways to identify additional OT stories
Additional Annotation & QC • Top-Ranked Off-Topic Stories (TROTS) • Define search epoch • First 4 on-topic stories chronologically sorted • Find two highly-ranked off-topic documents for each topic-language • Precision • All on-topic (YES) stories reviewed by senior annotator to identify false alarms • All “not easy” off-topic stories reviewed • Adjudication • Review pooled site results and adjudicate cases of disagreement with LDC annotators’ judgments • Pooled 3 sites’ tracking results • Reviewed all purported LDC FAs • For purported LDC Misses • English and Arabic: reviewed cases where all 3 sites disagreed with LDC • Mandarin: reviewed cases where 2 or more sites disagreed with LDC