480 likes | 613 Views
word. word. ind 1. ind 1. U 1. U 1. sync 1,2. sync 1,2. S 1. S 1. ind 2. ind 2. U 2. U 2. sync 2,3. sync 2,3. S 2. S 2. ind 3. ind 3. U 3. U 3. S 3. S 3. Articulatory Feature-Based Speech Recognition JHU WS06 Planning Meeting June 4, 2006. Project Participants.
E N D
word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Articulatory Feature-Based Speech RecognitionJHU WS06 Planning MeetingJune 4, 2006
Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Ozgur Cetin (ICSI Berkeley) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Mathew Magimai (ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT)
Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Ozgur Cetin (ICSI Berkeley) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Mathew Magimai (ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT)
Meeting Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Ozgur Cetin (ICSI Berkeley) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW),Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Mathew Magimai (ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT)
This meeting is for: • Updating each other on the past 1.5 months’ work • Discussing/deciding on some issues • Dividing up into initial sub-projects • Agreeing on • A rough plan for the workshop • A detailed plan for the next month • Undergrad projects & mentors
Broad project goals • Building complete speech recognizers based on modeling articulatory features (AF) and developing an understanding of the issues involved • Parts have been done before: AF classification, combining AF classifiers with phone-based recognizers, feature-based pronunciation modeling • Only very recently has work begun to combine the ideas into complete recognizers • Investigating three main types of models, represented as dynamic Bayesian networks • Fully-generative • Hybrid • Tandem • Investigating models of articulatory asynchrony and reduction (feature-based pronunciation models) • Testing recognizers on • Small-vocabulary conversational speech (SVitchboard corpus) • Audio-visual speech
Other possible goals • Comparing different types of AF classifiers (NNs, SVMs) • Investigating embedded training of AF classifiers • Obtaining improved automatic AF transcription of speech • Analyzing articulatory feature data • Dependence on context, speaker, speaking rate, speaking style, ... • Effects of articulatory reduction/asynchrony on recognition accuracy • Developing a “meta-toolkit” for AF-based recognition • Developing/standardizing a set of GMTK wrapper tools • Applying our models to other languages (Arabic?)
Other possible goals • Comparing different types of AF classifiers (NNs, SVMs) • Investigating embedded training of AF classifiers • Obtaining improved automatic AF transcription of speech • Analyzing articulatory feature data • Dependence on context, speaker, speaking rate, speaking style, ... • Effects of articulatory reduction/asynchrony on recognition accuracy • Developing a “meta-toolkit” for AF-based recognition • Developing/standardizing a set of GMTK wrapper tools (in Perl Python Perl) • Applying our models to other languages (Arabic?)
Fully-generative models q (phonetic state) • Phone-based o (observation vector) • AF-based unfactored observation model factored observation model qi (state of AF i) o (obs vector) • (Note that the state is always factored in the AF-based models above)
Hybrid models • Phone-based p(o|q) ∝ phone classifier output • AF-based p(o|qi) ∝ ith AF classifier output
Tandem models • Phone-based o = phone classifier output • AF-based oi = ith AF classifier output
Asynchrony modeling coupled hidden Markov model(-like) asynchrony variables single asynchrony variable
Reduction (substitution) modeling • Context-independent ui (underlying/target state of AF i) si (surface/actual state of AF i) CL C N M O … CL .7 .2 .1 0 0 … C 0 .7 .2 .1 0 … N 0 0 .7 .2 .1 … … … … … … … … • With dependence of surface value on previous value (can encode, e.g. an articulatory smoothness constraint) ui si • With dependence of surface value on another context variable (speaker, instantaneous speaking rate, syllable stress, ...) ui context variable si
Audio-visual models v (visual obs) • Phoneme-viseme-based qv (viseme state) qa (phonetic state) • AF-based a (acoustic obs) v qi qj qk a
Sub-project preferences from previous meetings • Fully-generative audio models • Karen Ozgur Arthur • Hybrid models • SimonOzgur • Tandem models • Simon Ozgur • AF classifiers, embedded training • Simon Ozgur Mark Karen • Pronunciation modeling • Karen Ozgur Ari • Audio-visual models • MarkKaren Ozgur Ari • Possible independent projects (AriBronwynSteve) • Data analysis, articulatory transcription, feature selection, ...
Since the Apr. 23 meeting... • NN AF classifiers trained (Joe, Mathew, Simon) • SVM AF classifier work (Ozgur) • NN AF classifiers for video (Ozgur) • Manual feature transcriptions ongoing (Karen, Xuemin, Lisa) • Transcriber agreement measures (Nash, Lisa) • AVSR work on MIT webcam digits (Kate, Karen) • AVSR work on AVICAR (Mark) • Distributed tools for GMTK training/decoding (Karen, Partha, Chris, Simon) • SVitchboard phone baseline updated to be “WS06-ready” (Karen) • gmtkTie work (Simon)
A tentative plan • Prior to workshop: • Selection of feature sets for pronunciation and observation modeling (done) • Selection of corpora for audio-only (done) and audio-visual tasks • Baseline phone-based and feature-based results on selected data • Trained AF classifiers, with outputs on selected data • During workshop • Fully-gen, hybrid, and tandem sub-teams build initial systems and generate articulatory forced alignments • Embedded training & pronunciation modeling experiments using alignments, continued work on (1) • Integrate most successful ideas from (1) & (2) In parallel, audio-visual work throughout
Manual articulatory feature transcriptions • A suggestion from ASRU meeting with Mark, Simon, Katrin, Eric, Jeff • Being carried out at MIT with the help of Xuemin and Lisa • Thanks also to Janet, Stefanie, Edward, Daryush, Jim, Nancy for discussions
“Ground-truth” articulatory feature transcriptions • Will be used for • Testing feature classifiers • Working on pronunciation modeling separately from observation modeling • Plan • Manually transcribe 50-100 utterances as test data for classifiers • Force-align a much larger set using classifiers + word transcripts + “permissive” articulatory model • Larger set will serve as “ground truth” for pronunciation modeling work • Procedure • Each transcriber annotates each utterance using a “phone-feature hybrid” transcription • Hybrid transcription is converted to all-feature transcription • Transcribers compare transcriptions and fix errors (not genuine disagreements) • Status • ~50 “practice” utterances & 35 “official” utterances • Should have another ~55 official utterances by end of June
Work on MIT webcam digits (Kate, Karen) • Webcam connected digits corpus (collected for speaker ID work) • 150 sessions x 26 random 20-digit sequences per session • 3 environments per session: “office”, “lobby”, “outside” ⇒ Very challenging visually and acoustically • Visual front end: Face tracking extraction of region of interest (ROI) around lips computation of discrete cosine transform (DCT) coefficients over ROI • Word error rates when training on office condition (slightly improved from 1st planning meeting) • Human video-only performance for a single test speaker: • Kate: 16.7% WER, Karen: 8.9% WER (note: test speaker was Karen)
Work on MIT webcam digits (Kate, Karen) • Issues with the corpus • Very challenging visually ⇒ perhaps standard video features (DCTs) aren’t good enough • Dropped video frames ⇒video & audio out of sync by non-constant amount • Proprietary AVI format means we don’t know which frames were dropped • Last few video frames often dropped while speaker is still speaking ⇒ We’re dropping this corpus for now • Possible alternative: CUAVE (Clemson U.) • 36 speakers reading isolated and connected digits while still & moving • Collected in a studio environment • Word alignments available • Previous work (Gowdy et al. 2004) shows improvement from modeling audio-visual asynchrony within the word • Have audio MFCCs & video DCTs for isolated digits portion from Amar Subramanya @ UW (thanks!) • Kate is working on isolated-digit baselines + liptracking for connected-digits front end
Distributed training/decoding (Karen, Simon, Partha, Chris) • Goal: Use identical training/decoding scripts at all sites, despite different types of distributed computing environment • Idea: Main scripts generate lists of commands to be run in parallel; user provides script to run each list on his/her setup • Status: • Have emtrain_parallel, viterbi_parallel bash scripts • Site-independent (in theory) • Gaussian splitting/vanishing by weight threshold or by top/bottom N weights • Stops & restarts during training • Splitting/vanishing + convergence in one run • Train over arbitrary utterance ranges • Written in bash • Doesn’t allow for multiple split/vanish + convergence phases in one run • Uses old masterFile/trainableParameters setup • emtrain_parallel has been tested at • Edinburgh: Tested successfully on GridEngine; not yet on Condor • UW: Tested unsuccessfully (?) on music (pmake) cluster & Condor • UIUC: ? • JHU: ? • Try it at home! (see ToolsCode page on wiki)
Context-independent phone-based model (for decoding) frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState{r_1, r_2, r_3, iy_1,...} obs
+ Multiple pronunciations per word frame 0 frame i last frame variable name values <S> <\S> word {“i”, “really” ,...} 1 wordTransition {0,1} pronVariant {0,1,2} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {r_1, r_2, r_3, iy_1,...} obs
SVitchboard baselines • Acoustic features: Speaker-normalized PLPs generated with HTK (thanks to Joe for scripts/config) • Speaker means/variances computed on all of SWB1 • Normalized to Fisher global variance for use with NN classifiers • Language model: Bigram trained using SRILM • Dictionaries: Based on MIT SLS group’s phone set and (PronLex-like) dictionaries • Up to 3 baseform prons per word (mainly unreleased/flapped stops) • + some tweaking • Context-independent 3-state Gaussian mixture phone models • Example dictionary: and: ae_1 ae_2 ae_3 n_1 n_2 n_3 dcl_1 dcl_2 and: ae_1 ae_2 ae_3 n_1 n_2 n_3 dcl_1 dcl_2 d_1 i: ay1_1 ay1_2 ay2_1 oh: ow1_1 ow1_2 ow2_1 okay: ow1_1 ow1_2 ow2_1 kcl_1 kcl_2 k_1 ey1_1 ey1_2 ey2_1 really: r_1 r_2 r_3 iy_1 iy_2 iy_3 l_1 l_2 l_3 iy_1 iy_2 iy_3 really: r_1 r_2 r_3 iy_1 iy_2 iy_3 ax_1 ax_2 ax_3 l_1 l_2 l_3 iy_1 iy_2 iy_3 right: r_1 r_2 r_3 ay1_1 ay1_2 ay2_1 tcl_1 tcl_2 right: r_1 r_2 r_3 ay1_1 ay1_2 ay2_1 tcl_1 tcl_2 t_1 so: s_1 s_2 s_3 ow1_1 ow1_2 ow2_1 the: dh_1 dh_2 dh_3 ah_1 ah_2 ah_3 the: dh_1 dh_2 dh_3 iy_1 iy_2 iy_3 well: w_1 w_2 w_3 eh_1 eh_2 eh_3 l_1 l_2 l_3 yes: y_1 y_2 y_3 eh_1 eh_2 eh_3 s_1 s_2 s_3 <SILENCE>: sil_1 sil_2 sil_3 <S>: sil_1 <\S>: sil_1 and: ae n dcl and: ae n dcl d i: ay1 ay2 oh: ow1 ow2 okay: ow1 ow2 kcl k ey1 ey2 really: r iy l iy really: r iy ax l iy right: r ay1 ay2 tcl right: r ay1 ay2 tcl t so: s ow1 ow2 the: dh ah the: dh iy well: w eh l yes: y eh s <SILENCE>: sil <S>: sil <\S>: sil
SVitchboard baselines: Experimental setup • “Grow” the Gaussian mixtures • Start with 1 Gaussian per mixture, train until 2% convergence • While WER is decreasing on dev set: • Split each Gaussian, train until 2% convergence (typically 3 iterations) • Test on dev set • Train best model, with no splitting/vanishing, until 0.2% convergence • Tune insertion penalty on dev set • Score using NIST’s sclite • Tasks tested on so far: 10-word, 100-word 10-word 100-word so right right okay oh yes okay okay so so right it i guess i'm not right and i guess that okay oh yes 500-word exactly he will take it more from exactly and i'm not sure that we could do that in such a way it's i like the idea all right that's better and that's the age they're looking at oh okay i'm really thinking that this guy might be good for us
SVitchboard 10-word experiments • For now, both acoustic and language models trained on SVB sets A,B,C and tested/tuned on D (i.e. all results are development results for now) Performance vs. training phase, ins. penalty = 0 Performance vs. ins. penalty, phase = 9c
SVitchboard 100-word error rates & timing info Performance vs. training phase, ins. penalty = 0 Performance vs. insertion penalty, phase = 9 (glitch, not actual WER increase) Time = Overall time on 50-60 nodes, mainly Pentium 4 3.2MHz Total time for this experiment: ~18 hours and counting
SVitchboard baselines: Conclusion • A possible framework for workshop experiments • Can now (soon) download SVitchboard example from the wiki • To do: • Tune parameters (split/vanish schedule, convergence ratios) • Do 5-fold cross-validation as per [King et al. ’05] • Repeat for other vocabulary sizes • Repeat using multiple frames of PLPs + transformation
P(w) language model w = “makes sense...” pronunciation model P(q|w) s = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(a|q) a = Clarification: Pronunciation vs. observation modeling Recognition ≡w*= argmaxw P(w|a) = argmaxwP(w) ∑q P(q|w) P(a|q)
Discussion topics • SVM AF classifiers? No • AVSR corpora? Mark keeps working on AVICAR, Kate & Karen on CUAVE • Distributed scripts & Condor Discuss offline • Undergrad projects Talk later today • Workshop timeline All set (a few slides down)
Testing pronunciation models • Recognize words from articulatory transcriptions • Do end-to-end recognition, keeping obs model fixed and switching in different pron models; retrain obs model only at the end (i.e. once pron model is fixed) • Re-train obs model each time we switch in a pron model
Sub-project assignments • Hybrid models • SimonSteve • Tandem models & fully-gen models • SimonOzgurKarenArthurNash • AF classifiers (before), embedded training (later) • Simon MarkKarenNash • Data analysis, articulatory transcription • ChrisAriBronwynKaren • Pronunciation modeling (starting w/whatever obs model is most available at start of workshop) • Karen Ozgur Ari ChrisBronwynLisaNashMark • Audio-visual models • MarkKaren Ozgur Ari ParthaSteve • Undergrad projects: • Articulatory transcription for audio & AV speech, possibly time-explicit async modeling later (Ari Karen) • Analysis of context variables for pronunciation modeling (Bronwyn Mark) • Structure learning for asynchrony structures and/or factoring of observation models, possibly join with Ari on time-explicit async modeling later (Steve Simon)
A rough timeline for the workshop AVSR Jul. 10 Aug. 17 retrain with improved obs models, incorporate results from data analysis w/best available baseline obs model Pron. modeling Hybrid, tandem, fully-gen recognizers train with basic pron model, generate alignments embedded training for hybrid/tandem models any remaining analysis of manual transcriptions Data analysis analyze automatic alignments once available work throughout independently of other sub-projects (though may incorporate ideas from data analysis etc.)
To do before workshop • Finish manual transcriptions + basic analysis • Karen Nash Lisa • Polish tools • Chris SimonPartha Karen • Complete SVitchboard baselines (Karen ChrisOzgur) • Modify Karen’s baselines to use training & pronunciation lattices (Chris) • Build triphone models (OzgurChris) • Update feature-based generative baseline, + a version with factored obs model (Karen) • Tune parameters • Do 5-fold cross-validation probably at start of workshop • Repeat on other vocab sizes • Transfer to JHU (& other sites) (Everyone) • Acoustic features: PLP, PLP+LDA, LCBE?; AV features • NN/SVM outputs on all of SVitchboard • Dictionaries • Language models • Distributed tools }