220 likes | 266 Views
This project explores bootstrapping new lexicons using phonological features and ASWUs for ASR systems in multilingual environments. The proposal aims to derive lexicons from acoustic data, word-level transcripts, and phonological feature detector outputs for improved ASR performance. Leveraging techniques from previous works, the system starts with a basic dictionary and iteratively refines pronunciations utilizing feature detectors and automatic metrics for splitting or deleting phones.
E N D
What I did on my Summer “Vacation“ Jeremy Morris 10/06/2006
Summer at AFRL - DAGSI • AFRL • Air Force Research Labs • Wright-Patterson AFB, Dayton OH • DAGSI Student/Faculty Resarch Fellowship program • Dayton Area Graduate Studies Institute • Effort to encourage collaboration between Ohio universities and AFRL
Summer at AFRL – SCREAM Lab • SCREAM Lab • Speech and Communication Research, Engineering, Analysis and Modeling Lab • Interest in a wide variety of speech research issues for the military • Speech-to-speech translation, rapid development of speech recognition systems, etc.
Summer at AFRL – Why us? • SCREAM Lab members were interested in collaborating with OSU • SCREAM Lab working on research in using phonological features in speech recognition • Perceived overlap with ASAT project
Review – Phonological Features • For the ASAT Project, we have been using phonological feature detectors • We train detectors on a particular phonological feature • e.g. manner or place for consonant, height, frontness, etc. for vowels • We then combine these features together for ASR purposes
Phonological Features (cont.) • SCREAM Lab very interested in phonological feature detectors • Need for quick development of new ASR systems for new languages • A full set of phonological feature detectors would allow reuse of acoustic data for training across new languages • Multi-lingual detectors are clearly needed to get full coverage of all features
Phonological Features (cont.) • Our phonological feature detectors • Monolingual (English only) • Trained using a set of multi-layer perceptron neural networks • Output a set of phonological feature class probabilities • SCREAM lab feature detectors • Monolingual and multilingual • Trained using Gaussian Mixture Models • Output a set of likelihoods • Based on work by Tanja Schultz (CMU)
Summer at AFRL - Proposal • Besides acoustic models, new ASR systems for new languages have other needs • An ASR system needs a lexicon mapping phones-to-words • Normally hand-constructed • Require time and expertise
Summer at AFRL - Proposal • Our proposal: look at methods of bootstrapping new lexicons from: • Acoustic data • Word-level transcripts • Phonological feature detector outputs • How? • Start by looking at work on deriving Acoustic Sub-Word Units
Summer at AFRM - Proposal • Acoustic Sub-Word Units (ASWUs) • Similar to phones in that they are smaller pieces of words • BUT – automatically derived from acoustics instead of manually defined • Used to derive both a sub-word unit set and a lexicon for that set simultaneously • Research in this area has been mainly to improve ASR performance
Summer at AFRL - Proposal • Can we use these methods along with phonological features as inputs to induce new lexicons? • Using phonological features, the sub-word units may be mappable to standard IPA phone labels
Summer at AFRL - Proposal • The proposed system is inspired by an ASWU by (Singh et al., 2002) • Notable for not requiring word boundaries to be marked for training • Start with a basic dictionary (including a starting phoneset size) • Train a set of acoustic models on the training data with that dictionary • Alter the basic dictionary in a manner that improves your pronunciations • Repeat until a stopping criterion is reached
Summer at AFRL - Proposal • Start with a basic dictionary • Start with an assumption that the number of phones in a word is related to the number of letters in the orthography • Basic dictionary maps word to sequence of letters in that word: ABLE A B L E BANNED B A N N E D
Summer at AFRL - Proposal • Train a set of acoustic models • Using the basic dictionary, map words in the transcript to these “pronunciations” • Train an HMM-model using the output of the feature detectors as its input, and the above mapping as training labels
Summer at AFRL - Proposal • Alter the basic dictionary • Using some metric, find a candidate “phone” to be modified • We’ve looked at a couple of metrics – more on this later • Once the phone is identified, see if the phone should be “split” or “deleted” • A “split” indicates that the given phone label actually represents two different sounds, and so should be replaced with two different phone labels • A “delete” indicates that for a particular word or words the model fits better if that phone label is removed from the pronunciation
Summer at AFRL - Proposal • Split example: BE B E DEVELOP D E1 V E1 L O P • Delete examples: ABLE A B L E :: ABLE A B L ABANDONED A B A N D O N D
Summer at AFRL - Proposal • For splits, all possible alterations are added to temporary lexicon • For deletes, we alter the HMM to add a possible deletion arc for the phone • After lexicon or HMM is altered, word transcript is force aligned using new possible pronunciations • Best pronunciations are pulled from this alignment and used to build new lexicon • Steps are repeated using the new lexicon in place of the basic lexicon
Summer at AFRL - Proposal • How do we determine the candidate “phone label” to alter? • Initially, modelled each phone with two Gaussians in the HMM • Compared the two Gaussians to each other using their KL-divergences • Took the phone label with the largest KL divergence as the one to alter • Idea was that each Gaussian described a cluster – the further these centers were from each other, the more probable they were describing two different phones
Summer at AFRL - Proposal • KL-divergence metric did not work well • System would pick candidates that a human would find unreasonable (such as “F” or “Q”) • System would split or delete these phones multiple times, continually returning to the same phone label
Summer at AFRL - Proposal • Why did the KL divergence perform this way? • Suspcion: Large variations in the two Gaussians in areas that do not matter for that phone pushed up the scores (e.g. vowel features for consonants) • Splitting these phones only allowed the coverage to spread wider, drawing the system back to those phones
Summer at AFRL - Proposal • What next? • Tried Mahalanobis distance metric, with poor results also • Returned to Acoustic Sub-Word papers for inspiration • Instead of looking at cluster stats, multiple papers use an average frame likelihood metric for each phone cluster to determine candidate phone for altering • Have started moving my code to use this framework – preliminary passes show promise, but no results quite yet
Conclusion – It’s 75 miles to Dayton • Advice for those thinking of doing work at WPAFB • Working in the SCREAM Lab was great • Hundreds of processors, tons of multi-lingual corpora • Friendly people, decent work environment (if a bit dark) • Many hoops to jump through, even just for a summer student • ID badges, computer usage training, etc. • Sometimes feels like you’re working at a corporation… • until the guys in uniform come around • The base is built like a campus crossed with a prison • cinderblock is the building material of choice. • Don’t forget your ID Badge • It’s 75 miles from Columbus to Dayton