450 likes | 551 Views
Acoustic Adaptation and Accent Identification in the ICSI MR and FAE Corpora. Javier Macías-Guarasa International Computer Science Institute Berkeley, CA - USA. Overview. Introduction Acoustic adaptation MR SI task MR SD task Accent identification MR SI task FAE task Conclusions
E N D
Acoustic Adaptation and Accent Identification in the ICSIMR and FAECorpora Javier Macías-Guarasa International Computer Science Institute Berkeley, CA - USA
Overview • Introduction • Acoustic adaptation • MR SI task • MR SD task • Accent identification • MR SI task • FAE task • Conclusions • Future work
Introduction (I) • Work on improving WER for non-native speakers in the ICSI MR corpus • General details on the Meeting Recording corpus: • Number of speakers: 61 • Speech segmented:85:08:21 • Number of accents: 15 • ‘Workable’ accents: • American 53:12:35 15m+8f • German 11:37:01 10m+2f • Spanish 04:38:24 4m+1f • British 01:03:45 2m+0f just for reference
Introduction (II) • Initial idea: • Pronunciation modeling for non native speakers • Acoustic adaptation techniques to be tested first: • SRI Decipher system capabilities: • MAP/MLLR/PhoneLoop • Analyze different strategies • Speaker dependent and independent tasks
Introduction (III) • Accent identification: • Needed to effectively use accent-dependent models in a real-world system • Emphasis in ‘practical’ approaches using, again, SRI Decipher capabilities • MR task is a difficult acoustic environment: • Low number of speakers/speech material • Certain speakers dominance (more details?) • FAE task also approached
Introduction (VI)Baseline WERs • Using SRI 2003 system, WER:
Acoustic adaptation (I) • Initial studies with old partitioning shows that global task adaptation through MAP is the best approach: • Accent-dependent MAP adaptation also promising • Initial attempt to do full retraining using 16KHz speech (also 8KHz speech as reference): • Very bad results (more details?) • Worse than baseline!! • Too few speakers in the training set given the task partition (speaker independent)
Acoustic Adaptation (II) Previous work • Interest in language learning tools (CALL) • Standard acoustic adaptation techniques • MAP/MLLR using L1 or L2 speech data • Model interpolation • Clustering • Sufficient for high proficiency speakers • Pronunciation modeling: • Little (if any) success reported
Acoustic Adaptation (IV) Objectives • Strategies for SI task, ¿combined improvement?: • Task MAP adaptation (TaskMAP) • Accent dependent MAP (AccMAP) • TaskMAP followed by AccMAP/MLLR • Strategies for SD task: • Task MAP adaptation (TaskMAP) (includes speaker adaptation) • Per speaker MAP adaptation (SpkMAP)
Full DB MAP(task adaptation) Global MAP models Am DB Am MAP models MAP/MLLR Ge DB Ge MAP models MAP/MLLR SWB models . . . Sp DB Sp MAP models MAP/MLLR ? OR Acoustic adaptation (V) • Strategies for Acoustic adaptation: • Adaptation weights tuned per accent (heldout) • Final phoneloop stage TaskMAP AccMAPTask+AccMAP
Acoustic Adaptation (VII) MR Speaker Independent Task • SI adaptation summary, WER:
Acoustic Adaptation (VIII) MR Speaker Independent Task • SRI 5xRT system: • Using new dictionary and interpolated LMs • Using best map adapted models for mel features • Still some bugs in the process (more details?)
Acoustic Adaptation (X) MR Speaker Dependent Task • SD adaptation summary, WER:
Accent Identification (I) • Background: • Techniques similar to Language Identification • GMM based: • Broad collection of features • GMM tokenizers • Broad phonetic classes + HMMs • LM/AM score comparison • Based on phonotactic characteristics: • PPRLM, PRLM • More complex than LID • Hard to compare rates: No previous work in MR/FAE
Accent Identification (II) Objectives • Strategy: Use SRI Decipher characteristics • Practical approach: Reasonable run times • GMM classification module (for gender detection) • Evaluate standard features and normalizations • Hypothesis driven, phone recognition: • CD/CI models • Recognition using flat Phone LM or flat LM • View as a text classification problem • Phone LM driven: • PRLM/PRLM • Combination using NNs
Accent Identification (III) MR data: MM classification approach • GMM results for MR corpus: • Unbalanced data tested over and under sampling • Use different features & normalization: • No significant differences (except when using voicing features): • lack of data • ~Uniform channels
Accent Identification (IV) MR data: GMM classification approach • GMM results for MR corpus : • As a function of utterance length task AM-GE-SP-BR
Accent Identification (V) MR data: Hypothesis driven approach • Text classification view using MR data: • Input from phone recognition: • From free phone recognition (CD/CI models, full/flat PLM) • Rainbow: CMU tool for text classification • Naive bayes classification technique • N-grams (1..6) • No further restrictions (feature selection, stop list, etc.)
Accent Identification (VI) MR data: Hypothesis driven approach • Text classification view using MR data: • Best results using CI models + flat PLM (bigrams & trigrams) • Chunk based classification rates (simulation):
Accent Identification (VII) MR data: Hypothesis driven approach • Text classification view using MR data: • Utterance based classification rates (simulation): • Need longer sequences!!
Accent Identification (VIII) MR data: Hypothesis driven approach • Text classification view using MR data: • Real partition classification rates: • Worse-than-chance rates if utterance based (pending to do length-dependent AI task)
Accent Identification (IX) Phone LM approach • PRLM: Phone recognition & LM PLM accent 1 LM scoring Score comparison Score comparison . . . Speech Phonerecognizer Decision Score comparison LM scoring AM PLM accent N
Accent Identification (X) Phone LM approach • PRLM: Phone recognition & LM • Tested different AMs for phonetic string generation: • Std forced • Std SWB • MAP adapted per accent • Best is Std SWB • Tested 1-6gram: • Best is trigram • But very poor results
Accent Identification (XI) MR data: Phone LM approach • PRLM: Phone recognition & LM: • As a function of utterance length, task AM-GE-SP-BR: Very bad results
Accent Identification (XII) Phone LM approach • PPRLM: Parallel Phone recognition & LM LM scoring Accent a Phonerecognizer Models A Avg accent a Avg accent a . . . Score comparison Avg accent a Score comparison LM scoring Accent z . . . . . . Speech Decision LM scoring Accent a Score comparison Avg accent a Avg accent z Phonerecognizer Models Z . . . Avg accent a LM scoring Accent z
Accent Identification (XIII) FAE database • Experiments with the FAE database: • 4500 speakers: More acoustic context • 20 seconds per speaker • Proficiency is labeled • Strategy: • Apply standard techniques • Possibly: • Use FAE-generated models in MR data
Accent Identification (XIV) FAE database: GMM classification • GMM: • Gender independent classification (16-2048) • FAE results in GE-SP task: • Norm better than CMN. CMN better than plain features • Pending to test GD models
Accent Identification (XV) FAE database: GMM classification • GMM: • Combining FAE models with MR data: • Using frame_cepstrum + CMN (GMM 256) • Combination is possible, but more experiments are needed!!
Accent Identification (XVI) FAE database: hypothesis driven • Text classification view: • FAE results: • Better than chance but, still, far from useful • Pending to test FAE models in MR data
Accent Identification (XVII) FAE database: Phone LM approach • PRLM/PPRLM: • Pending • GMM better than text based classification. GE-SP task, for example: • GMM: 72.0% • Text-based: 58.9% • Results as a function of speech length to be evaluated
Conclusions • Acoustic adaptation is important to face non-native accents: • MAP adaptation provided best results: • Task adaptation+accent adaptation • Work on tuning adaptation weights for SD & SI task (magnitude differences) • Low proficiency speakers need additional improvements • Non native speech recognition may not be solvable!
Conclusions • Accent identification: • Proved to be more difficult than LID • Different techniques applied: • GMM techniques and text classification techniques showed promising results • Standard PRLM strategy didn’t work as expected (score normalization needed?) • PPRLM to be tested • Integration to be tested
Future work • Finish current experimentation: • Accent identification: • Test features and normalizations in GMM and phone LM based • Test acoustic scores ratios • Test LM scores • Test NN based combination • NonNat speech characterization: • Errors phone/word • Model ‘usage’ distributions
Future work • Pronunciation modeling: • Evaluation of pronunciation variants found in the SRI SWB dictionary for NonNat speech • Rule based: • Rules in German (from Silke Goronzy’s work) • Rules in Spanish • ‘Speaking mode’ probability estimation (accent + …) • Use of new databases (FAE, TED, Fisher)
Future work • A note on work on pronunciation modeling in the MR task: • The MR corpus is not suitable for data-driven pronunciation modeling: • High error rates for non native speakers & limited number of them • Rule based methods are to be tested first • Initial work on evaluating current pronunciation alternatives is needed • I got relevant rules for initial testing in German and Spanish
Thank you!! • To ICSI and the ICSI Speech Group, with special emphasis to: • Morgan • Andreas • Qifeng, Barry, Adam, Yang, Yan, Dave, Jeremy, … • Sven & all international visitors • The FrameNet people (Miriam, Michael & Co.) • Staff, specially Lila, María Eugenia and Diane
MR Partitioning • Speaker independent (SI subtask)
Full retraining • Initial attempt to do full retraining using 16KHz speech: • With old partitioning • Too few speakers in the training set given the task partition (speaker independent)
Speaker dominance • Few speakers concentrate most speech material:
Acoustic adaptation (IV) • SI TaskMAP adaptation, WER: • Optimal map weight ~proportional to size of accented speech subset • Bigger improvements in non native accents • Bigger improvements for bigger data size
Acoustic adaptation (IV) • SI AccMAP adaptation, WER: • Similar trends than TaskMAP, but no further improvements, except german benefits from task data!
Acoustic adaptation (IV) • SI TaskMAP+AccMAP adaptation, WER: • Small improvements over TaskMAP • Also tested MLLR instead (taskMAP+AccMLLR), but no improvements
Gender ID issues • Gender identification: • Per chunk gender ID: • Per utterance gender ID:
Acoustic AdaptationSRI 5xRT System results British 002-mel 002-mel-expanded 005-plp 005-plp-rescored rover STD 91.5 89.3 78.2 80.6 78.7 Adapted 87.9 84.3 78.8 79.7 79.7 BestSimple 87.9 ------------------------------------------------------------------------------ Spanish 002-mel 002-mel-expanded 005-plp 005-plp-rescored rover STD 97.5 94.2 95.9 95.1 93.4 Adapted 88.0 85.1 88.4 88.2 86.6 BestSimple 93.2 ------------------------------------------------------------------------------ German 002-mel 002-mel-expanded 005-plp 005-plp-rescored rover STD 50.7 47.6 44.7 44.5 44.1 Adapted 45.8 45.8 44.4 44.5 44.9 BestSimple 42.3 ------------------------------------------------------------------------------ American 002-mel 002-mel-expanded 005-plp 005-plp-rescored rover STD 36.3 33.2 31.2 30.8 31.0 Adapted 33.6 34.3 33.3 33.1 33.6 BestSimple 30.4