1 / 15

What Happens Next?

EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,myl@ldc.upenn.edu}. What Happens Next?. Collect feedback here

roger
Download Presentation

What Happens Next?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman{ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,myl@ldc.upenn.edu}

  2. What Happens Next? • Collect feedback here • Check feasibility of new ideas • e.g. availability of BN (tran)scripts • Estimate cost, timeline for wish list • Sponsors allocate funds • EARS Board revise priorities • Re-estimate cost, timeline for task list • Communicate final plan • “Start”

  3. What Happened Next? • Feedback was generally favorable • Next day learned of 3 month projects • Received 25% funding • Preparation of utility thresh holds • Learned of TIDES/EARS end • Learned that GALE <> TIDES+EARS • Completed existing commitments • STT Test Sets (MT Test Set) • CTS Collections • Adjusted focus to GALE preparation

  4. Broadcast News • Continue 2004 collection • >2000h English: VOA, NBC/MSNBC, CNN, ABC, PBS, PRI, WB17 • >1000h Chinese: VOA, CCTV, Radio Free Asia (RFA), NTDTV, Tai Yuan • >1000h Arabic: VOA, Al Hurra, Al Jazeera, Dubai, Jordan TV, LBC, Nile • Select 2005 evaluation set then distribute 2004 data (February 2005) • delivery made after eval set picked • 2005 Collection same sources, volumes • add semi-automatic language, source, program ID to QC process • harvest (tran)scripts where possible • 100 hours of transcribed Chinese BN (commercial, QTr) • 100 hours of transcribed Arabic BN (commercial, QTr) • collect broadcast conversations: audio and (tran)scripts • Continue IPR negotiations • Contribute to Experiments • Utility of Careful vs. Commercial vs. QTr. vs. CC. vs. Roverized ASR • Update pronouncing Lexicons with vocab from English, Chinese, Arabic • Continue collection with sources adjusted for GALE • Greater focus on broadcast conversation • Total: 62.5 hrs/week of Arabic, 60 hrs/week of Chinese, 75 hs/week of English • BC: 2.5 hours/week Arabic, 15 hours/week Chinese, 25 hours/week English • Acquired IPR for several new programs: 100% English 50% of Arabic, Chinese

  5. English CTS • Volume: complement 2003 collection to provide another 1400 hours (was 850) with subjects making 1-20 10-minute calls • Used November 2003 Topics • BBNT/WordWave doing transcription • Complete collection of 1400 hours • Finalize evaluation set • Distribute beginning in December as transcripts are ready • 1400 hours sent to BBN/WordWave for transcription • 450 hours distributed to sites February 17

  6. Chinese CTS • New Collection at HKUST • Target 200 hours transcribed, gender balance, regions represented • Transcription based upon RT03 • 150 hours in delivered to LDC so far • regions not balanced across delivery increments • Select 2005 evaluation & dev/test sets • to control demographics across train/test sets • Deliver training data once final increment has arrived and evaluation data extracted • Repeat collection in 2005 • require gender, age, regional balance across collection epoch • require word segmentation? • Build portable platform? • HKUST finished Collection of 150 hours of CTS • ready for release once test set extracted • will deliver 50 more hours at end of March • will collect & transcribe another 50 hours through June

  7. Arabic CTS • Fisher Protocol, platform in US • Select 2005 evaluation set from current collection • Continue collection until current pool sapped • Complete audit and transcription; deliver in December • Add ‘yellow’ tier (surface phonemic) transcription • Build portable platform? Begin new dialect? • Demographics changed since last test sets created • new Dev/Test as well as Eval set required • Finished 50 hours of Levantine Arabic CTS • Released on 01/15/2005 as LDC 2005SO7 & LDC 2005TO3 • 50 more hours of Levantine due March 31, 2005 • 85 hours scheduled June 30, 2005 ??? • Yellow layer transcription of 15h underway • RT rates improving: 8-10xRT on green, 15xRT yellow (assuming green)

  8. STT Test Sets • None

  9. MDE • Ported English specification v6.2 to Chinese, Arabic • Created MDE v7 specification, tool for English • Created Chinese and Arabic tools • Created small pilot data set in each language • Distributed as: LDC2004E47

  10. GALE Preparation • Created 13 new Fisher English topics designed to elicit ACE worthy conversations • Collected 500 conversations; manually selected 25% for transcription. ACE transcribed; are in ACE annotation pipeline • LDC Staff Read DLI DLPT material in Arabic • LDC Staff read WSJ articles • In preparation for GALE, adding new source types • e-lists, blogs, chat, technical reports, GovDocs • Built general purpose speech annotation toolkit; ready April 1.

  11. Distribution Rules • Most EARS sites are LDC members • Those who are not have data under evaluation agreement • Require return at end of program • LDC will offer extension; sites not part of GALE by June 2005 must return data then • Or non-members, non-GALE sites can keep data by becoming LDC members • Exception drive arrays of BN data. This must be returned by both members and non-member not involved in GALE

  12. GALE-related efforts • Data scouting in English, Chinese, Arabic • Exploring new domains • Broadcast conversation (roundtable, talk shows, call-ins) • Web text (blogs, newsgroups, chat, discussion forums) • Defining best practices • Identifying, Harvesting, Formatting, Licensing • Researching more economical sources, methods • Transcripts, story segmentation • Annotation efficiencies • Local infrastructure in place • Annotation toolkit • Annotation guidelines & web resources guide • Scouting teams for English, Chinese • Arabic lagging • Sharable version of tools, docs in progress • To date, • English: 270 sites identified (16 topics) • Chinese: 57 sites identified (10 topics) • Arabic: 10 sites identified (3 topics) • All of these now/soon in ACE annotation pipeline • IPR secured under “fair use”

  13. Documentation

  14. Process • Use search engine to find sites for each types • Minimum thresholds for each data type/subject • Tool tallies good/bad sites identified; logs URLs/judgments to DB • Categorize URLs as good or bad for TIDES-type annotation • “Bad” URLs are not revisited for a topic • The top pane of the tool is occupied by a web browser. • The left side of the web scouting tool shows a tally of the data types found for the annotator’s topic. • The bottom pane of the tool is a window where the annotator inputs information, including data type, title, and URL, for each site that he finds.

  15. Up-to-minute updates http://www.ldc.upenn.edu/Projects/GALE/Annotation/DataScouting/status.php

More Related