1 / 5

Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes

TT-LOC. TB-OPEN. TT-OPEN. VELUM. LIP-OP. GLOTTIS. Articulatory Feature-based Speech Recognition: A Proposal for the 2006 JHU Summer Workshop on Language Engineering. November 12, 2005. Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes.

kennan
Download Presentation

Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TT-LOC ... ... ... TB-OPEN TT-OPEN VELUM LIP-OP GLOTTIS Articulatory Feature-based Speech Recognition:A Proposal for the 2006 JHU Summer Workshop on Language Engineering November 12, 2005 Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko

  2. . . . = = - = 1 ; 2 1 2 Pr( async a ) Pr(| ind ind | a ) given by baseform pronunciations = 1 = 1 0 1 2 3 4 … 0 .7 .2 .1 0 0 … 1 0 .7 .2 .1 0 … 2 0 0 .7 .2 .1 … … … … … … … … Dynamic Bayesian network implementation: The context-independent case Example DBN with 3 features:

  3. (rest of model) SGLOT SLIP-OPEN STT-OPEN STB-OPEN voiced sonorant P(SEsonorant = 1 | sonorant) = PSVM(acoustics | sonorant) SEvoiced = 1 SEsonorant = 1 Combination of articulatory phonology coarticulation modeling with IPA feature-based acoustic modeling (deterministic mapping) • Suggests a potential work plan: • 1st half of workshop: Sub-teams work in parallel on • (1) Set of features and classifiers for acoustic model, using only articulatory “ground truth” and acoustics • (2) Aspects of hidden structure (asynchrony, substitutions, context dependency), using only articulatory “ground truth” and words • 2nd half of workshop: Integrate most successful methods from 1st half

  4. Resources • Tools • GMTK • HTK • Intel AVCSR toolkit • Data • Audio-only: • Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational • PhoneBook: Medium-vocab, isolated-word, read • (Switchboard rescoring? LVCSR) • Audio-visual: • AVTIMIT (MIT): Medium-vocab, continuous, read, added noise • Digit strings database (MIT): Continuous, read, naturalistic setting (noise and video background) • AVICAR, UIUC • Articulatory measurements: • X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab, isolated-word and continuous • MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous • Others? • Manual transcriptions: ICSI Berkeley Switchboard transcription project

  5. Question to address (soon) • Audio-only, audio-visual only, or both? • Audio-only • Better understood by current team members • Has more spontaneous speech data • Audio-visual • Potentially, many more interesting phenomena in read data • Visual observations more closely tied to articulatory features • Smaller tasks  faster turnaround time  higher impact? • Can we reliably decouple investigation of acoustic modeling and pronunciation modeling? • Evaluation via measures other than word error rate • Forced alignments • Articulatory tracking • Reasonableness of model parameters • (Multi-style ASR: Train on slow, test on fast?)

More Related