230 likes | 384 Views
Children’s Speech Recognition for Educational Applications. Rohit Kumar 11751 Course Project. Motivation Literature Survey Approach Results Observation Conclusions. Motivation. To develop speech interfaces for Intelligent tutoring system for kids Kids are usually too young to type
E N D
Children’s Speech Recognition for Educational Applications Rohit Kumar 11751 Course Project
Motivation • Literature Survey • Approach • Results • Observation • Conclusions
Motivation • To develop speech interfaces for Intelligent tutoring system for kids • Kids are usually too young to type • Potentially: Higher interactivity (and engagement in the learning process) by use of speech • Domain: Mathematics problem solving
Literature Survey • Oral Reading tutor based on use of ASR • Mostow et. al., 1994 • Cole et. al., 1999 • … • Shown to be effective in the learning to reading domain • In mathematics domain: Not much work • MathTalk: a commercial mathematics learning system which uses off the shelf ASR (Ex-Dragon) and based on speaker adaptation • Makes it difficult to get in domain, language data
Literature Survey • But Improvements to ASR for Children has been explored even outside the area of Reading tutor • For conversational interfaces • Narayanan and Potamianos, 2002 • Oviatt, 2000 • 100%+ higher WERs for children’s speech than for Adult speech (90’s). Why ? • 1 reason: 4 KHz is insufficient bandwidth • Second and third formants are much higher than that for adults
Literature Survey • Several papers on improving speech recognition for children’s speech • Frequency Warping (or other VTLN) • Das, et. al. 1996 report 5.8% improvement in WER (absolute) in isolated word recognition task • Elenius 2004 reports 4% absolute improvement on connected digit recognition • Interesting positive correlation between height & age of the speaker and accuracy
Literature Survey • Trend for WER • Adult Model > • VTLN on Children’s speech > • MAP on models using children’s speech > • MLLR on models using children’s speech > • Children’s model Elenius 2005 **
Why this problem is hard ? • Spontaneous speech • Lot of filler words (~6%) • Children’s speech • Often disfluent • Mispronunciations, false starts
Approach • Using existing data models for experimentation • Adult Speech • CMU HUB4 open source acoustic models • Communicator models • Children’s speech • CMU KIDS, Eskenazi97 • In domain data collection & pre-processing • MathKids • ASR • Sphinx3: Semi-Continuous and Continuous Models • MLLR based adaptation • Language Models: In domain transcripts
CMU KIDS • 6 – 11 year all kids • 24 male, 52 female • They claim that the mismatch may be justified by the fact that vocal tract length difference is not significant for children across genders • 5,180 utterances • Cleaned up the corpus to get 2481 utterances (~4 hours) to avoid weird transcripts (which would need additional language resources to build models)
MathKids • Collect by me at Carnegie Science Center • Kids of 7-11 year age enrolled in data collection • 9 participants (3 female, 6 male) • Kids were asked to think aloud while solving fractions related problems • Nearly 2 hours of spontaneous speech, audio collected • Segmentation of the 15 - 20 minute long files of each subject into utterances automatically • CMUSeg (Mike Seltzer): very big segments, may be good for broadcast news, but not good for spontaneous speech • Based on speaker segmentation • Sphinx Segmenter tool (Ravi Mosur): good for spontaneous speech, eliminates chunks of noise/silence during segmentation (can be bad sometimes as we found during transcription) • Based on heuristics on power contour to mark speech segments
Math Kids • After Segmentation • Create 534 audio files = ~35 minutes of speech • Manual transcription of the 534 files • Split across 2 people • 1 round only: No checking • Using random sampling across all speakers • Train set: 429 utterance ~80% • Test set: 105 utterances ~20% • Dictionary created from all words in the transcripts (train + test) using CMU LM tools i.e. CMUDict + Festival LTS • 468 unique words
Acoustic Models • CMU Communicator • Conversational speech • Semi Continuous models • CMU Hub4 • Continuous models (4 Gaussians per state) • KIDS • Semi continuous models • Semi continuous models: MLLR adapted to MathKids trainset • Continuous models (4 Gaussians) • Math Kids • Semi continuous models • Continuous models (4 Gaussians)
Language Models • Math Kids training set transcripts • trainset + testset vocabulary • 3 gram • Good turing • Math Kids training set transcripts • trainset + testset vocabulary • 2 gram • Good turing • Math Kids training + test set transcripts • 3 gram • Good turing
Testing configurations of Speakers • Females (3) : F1, F2, F3 • Males (4): M1, M2, M3, M5 • All Males: M • All Females: F • All
Observations • Best Performance • AM built on Semi-continuous in-domain data (despite being so small in size) • LM built on train + test set transcripts, • If we do not want to include test set, • Semi-continuous in-domain AM still best • Sometime bigram model is better than trigram • Adding test set gives 1% to 15% absolute improvement • Adult models are too bad for this task • Continuous models of children also bad • Why? Too less data ! • Out of domain semi-continuous no good either • But MLLR adaptation on out of domain data improves those models • But still not good enough to be the best • However, i would guess that they would do better with unseen data ! must verify this
Observations • The above trends are inline with trend in Elenius05 • WER Adult Models > WER MLLR Adapted models > WER Children’s mode
Observations • Higher WERs for females (Correlation: 0.6) • Despite KIDS corpus having more female data • Could be because more male data in Math Kids corpus • Female speech recognition results are usually worse (??) • This result is inconsistent with that reported in Elenius05 (Correlation: 0.11) • Also could be considered inconsistent with the claim in Eskenazi97 that due to little vocal tract length difference in kids across gender, the skew in amount of data from each gender may not matter • Lower WERs for higher age (Correlation: 0.54) • Consistent with Elenius05 • May be it is better to collect data from younger kids of the target population than the elder kids must verify
Conclusions & Lessons • Given the amount of data, Semi-continuous models would be a good choice • Wasted a lot of time with continuous models as they are the default with Sphinx3 • MLLR adaptation can lead to significant improvements in recognition performance • Recognition data collection effort takes too much time • Can’t get them to talk about Mathematics for too long in thing aloud mode • Mathematics problems given to them should match their age
Future Work • Experimentation with Feature transformation • Frequency warping based VTLN • Since the difference between male and female results • Improvement of language model by • More data • Corpus created from domain specific semantic grammar • Language model interpolated with transcripts of other conversational corpora (Could not find one of these easily !!!) • Improvement of acoustic model by • More data (within domain) • Checking transcriptions • Modeling disfluencies • Different models for different genders