50 likes | 161 Views
TT-LOC. TB-OPEN. TT-OPEN. VELUM. LIP-OP. GLOTTIS. Articulatory Feature-based Speech Recognition: A Proposal for the 2006 JHU Summer Workshop on Language Engineering. November 12, 2005. Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes.
E N D
TT-LOC ... ... ... TB-OPEN TT-OPEN VELUM LIP-OP GLOTTIS Articulatory Feature-based Speech Recognition:A Proposal for the 2006 JHU Summer Workshop on Language Engineering November 12, 2005 Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko
. . . = = - = 1 ; 2 1 2 Pr( async a ) Pr(| ind ind | a ) given by baseform pronunciations = 1 = 1 0 1 2 3 4 … 0 .7 .2 .1 0 0 … 1 0 .7 .2 .1 0 … 2 0 0 .7 .2 .1 … … … … … … … … Dynamic Bayesian network implementation: The context-independent case Example DBN with 3 features:
(rest of model) SGLOT SLIP-OPEN STT-OPEN STB-OPEN voiced sonorant P(SEsonorant = 1 | sonorant) = PSVM(acoustics | sonorant) SEvoiced = 1 SEsonorant = 1 Combination of articulatory phonology coarticulation modeling with IPA feature-based acoustic modeling (deterministic mapping) • Suggests a potential work plan: • 1st half of workshop: Sub-teams work in parallel on • (1) Set of features and classifiers for acoustic model, using only articulatory “ground truth” and acoustics • (2) Aspects of hidden structure (asynchrony, substitutions, context dependency), using only articulatory “ground truth” and words • 2nd half of workshop: Integrate most successful methods from 1st half
Resources • Tools • GMTK • HTK • Intel AVCSR toolkit • Data • Audio-only: • Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational • PhoneBook: Medium-vocab, isolated-word, read • (Switchboard rescoring? LVCSR) • Audio-visual: • AVTIMIT (MIT): Medium-vocab, continuous, read, added noise • Digit strings database (MIT): Continuous, read, naturalistic setting (noise and video background) • AVICAR, UIUC • Articulatory measurements: • X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab, isolated-word and continuous • MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous • Others? • Manual transcriptions: ICSI Berkeley Switchboard transcription project
Question to address (soon) • Audio-only, audio-visual only, or both? • Audio-only • Better understood by current team members • Has more spontaneous speech data • Audio-visual • Potentially, many more interesting phenomena in read data • Visual observations more closely tied to articulatory features • Smaller tasks faster turnaround time higher impact? • Can we reliably decouple investigation of acoustic modeling and pronunciation modeling? • Evaluation via measures other than word error rate • Forced alignments • Articulatory tracking • Reasonableness of model parameters • (Multi-style ASR: Train on slow, test on fast?)