150 likes | 300 Views
EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,myl@ldc.upenn.edu}. What Happens Next?. Collect feedback here
E N D
EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman{ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,myl@ldc.upenn.edu}
What Happens Next? • Collect feedback here • Check feasibility of new ideas • e.g. availability of BN (tran)scripts • Estimate cost, timeline for wish list • Sponsors allocate funds • EARS Board revise priorities • Re-estimate cost, timeline for task list • Communicate final plan • “Start”
What Happened Next? • Feedback was generally favorable • Next day learned of 3 month projects • Received 25% funding • Preparation of utility thresh holds • Learned of TIDES/EARS end • Learned that GALE <> TIDES+EARS • Completed existing commitments • STT Test Sets (MT Test Set) • CTS Collections • Adjusted focus to GALE preparation
Broadcast News • Continue 2004 collection • >2000h English: VOA, NBC/MSNBC, CNN, ABC, PBS, PRI, WB17 • >1000h Chinese: VOA, CCTV, Radio Free Asia (RFA), NTDTV, Tai Yuan • >1000h Arabic: VOA, Al Hurra, Al Jazeera, Dubai, Jordan TV, LBC, Nile • Select 2005 evaluation set then distribute 2004 data (February 2005) • delivery made after eval set picked • 2005 Collection same sources, volumes • add semi-automatic language, source, program ID to QC process • harvest (tran)scripts where possible • 100 hours of transcribed Chinese BN (commercial, QTr) • 100 hours of transcribed Arabic BN (commercial, QTr) • collect broadcast conversations: audio and (tran)scripts • Continue IPR negotiations • Contribute to Experiments • Utility of Careful vs. Commercial vs. QTr. vs. CC. vs. Roverized ASR • Update pronouncing Lexicons with vocab from English, Chinese, Arabic • Continue collection with sources adjusted for GALE • Greater focus on broadcast conversation • Total: 62.5 hrs/week of Arabic, 60 hrs/week of Chinese, 75 hs/week of English • BC: 2.5 hours/week Arabic, 15 hours/week Chinese, 25 hours/week English • Acquired IPR for several new programs: 100% English 50% of Arabic, Chinese
English CTS • Volume: complement 2003 collection to provide another 1400 hours (was 850) with subjects making 1-20 10-minute calls • Used November 2003 Topics • BBNT/WordWave doing transcription • Complete collection of 1400 hours • Finalize evaluation set • Distribute beginning in December as transcripts are ready • 1400 hours sent to BBN/WordWave for transcription • 450 hours distributed to sites February 17
Chinese CTS • New Collection at HKUST • Target 200 hours transcribed, gender balance, regions represented • Transcription based upon RT03 • 150 hours in delivered to LDC so far • regions not balanced across delivery increments • Select 2005 evaluation & dev/test sets • to control demographics across train/test sets • Deliver training data once final increment has arrived and evaluation data extracted • Repeat collection in 2005 • require gender, age, regional balance across collection epoch • require word segmentation? • Build portable platform? • HKUST finished Collection of 150 hours of CTS • ready for release once test set extracted • will deliver 50 more hours at end of March • will collect & transcribe another 50 hours through June
Arabic CTS • Fisher Protocol, platform in US • Select 2005 evaluation set from current collection • Continue collection until current pool sapped • Complete audit and transcription; deliver in December • Add ‘yellow’ tier (surface phonemic) transcription • Build portable platform? Begin new dialect? • Demographics changed since last test sets created • new Dev/Test as well as Eval set required • Finished 50 hours of Levantine Arabic CTS • Released on 01/15/2005 as LDC 2005SO7 & LDC 2005TO3 • 50 more hours of Levantine due March 31, 2005 • 85 hours scheduled June 30, 2005 ??? • Yellow layer transcription of 15h underway • RT rates improving: 8-10xRT on green, 15xRT yellow (assuming green)
STT Test Sets • None
MDE • Ported English specification v6.2 to Chinese, Arabic • Created MDE v7 specification, tool for English • Created Chinese and Arabic tools • Created small pilot data set in each language • Distributed as: LDC2004E47
GALE Preparation • Created 13 new Fisher English topics designed to elicit ACE worthy conversations • Collected 500 conversations; manually selected 25% for transcription. ACE transcribed; are in ACE annotation pipeline • LDC Staff Read DLI DLPT material in Arabic • LDC Staff read WSJ articles • In preparation for GALE, adding new source types • e-lists, blogs, chat, technical reports, GovDocs • Built general purpose speech annotation toolkit; ready April 1.
Distribution Rules • Most EARS sites are LDC members • Those who are not have data under evaluation agreement • Require return at end of program • LDC will offer extension; sites not part of GALE by June 2005 must return data then • Or non-members, non-GALE sites can keep data by becoming LDC members • Exception drive arrays of BN data. This must be returned by both members and non-member not involved in GALE
GALE-related efforts • Data scouting in English, Chinese, Arabic • Exploring new domains • Broadcast conversation (roundtable, talk shows, call-ins) • Web text (blogs, newsgroups, chat, discussion forums) • Defining best practices • Identifying, Harvesting, Formatting, Licensing • Researching more economical sources, methods • Transcripts, story segmentation • Annotation efficiencies • Local infrastructure in place • Annotation toolkit • Annotation guidelines & web resources guide • Scouting teams for English, Chinese • Arabic lagging • Sharable version of tools, docs in progress • To date, • English: 270 sites identified (16 topics) • Chinese: 57 sites identified (10 topics) • Arabic: 10 sites identified (3 topics) • All of these now/soon in ACE annotation pipeline • IPR secured under “fair use”
Process • Use search engine to find sites for each types • Minimum thresholds for each data type/subject • Tool tallies good/bad sites identified; logs URLs/judgments to DB • Categorize URLs as good or bad for TIDES-type annotation • “Bad” URLs are not revisited for a topic • The top pane of the tool is occupied by a web browser. • The left side of the web scouting tool shows a tally of the data types found for the annotator’s topic. • The bottom pane of the tool is a window where the annotator inputs information, including data type, title, and URL, for each site that he finds.
Up-to-minute updates http://www.ldc.upenn.edu/Projects/GALE/Annotation/DataScouting/status.php