Innovative Lexicon Bootstrapping from Phonological Features for ASR in Multilingual Settings

What I did on my Summer “Vacation“ Jeremy Morris 10/06/2006

Summer at AFRL - DAGSI • AFRL • Air Force Research Labs • Wright-Patterson AFB, Dayton OH • DAGSI Student/Faculty Resarch Fellowship program • Dayton Area Graduate Studies Institute • Effort to encourage collaboration between Ohio universities and AFRL

Summer at AFRL – SCREAM Lab • SCREAM Lab • Speech and Communication Research, Engineering, Analysis and Modeling Lab • Interest in a wide variety of speech research issues for the military • Speech-to-speech translation, rapid development of speech recognition systems, etc.

Summer at AFRL – Why us? • SCREAM Lab members were interested in collaborating with OSU • SCREAM Lab working on research in using phonological features in speech recognition • Perceived overlap with ASAT project

Review – Phonological Features • For the ASAT Project, we have been using phonological feature detectors • We train detectors on a particular phonological feature • e.g. manner or place for consonant, height, frontness, etc. for vowels • We then combine these features together for ASR purposes

Phonological Features (cont.) • SCREAM Lab very interested in phonological feature detectors • Need for quick development of new ASR systems for new languages • A full set of phonological feature detectors would allow reuse of acoustic data for training across new languages • Multi-lingual detectors are clearly needed to get full coverage of all features

Phonological Features (cont.) • Our phonological feature detectors • Monolingual (English only) • Trained using a set of multi-layer perceptron neural networks • Output a set of phonological feature class probabilities • SCREAM lab feature detectors • Monolingual and multilingual • Trained using Gaussian Mixture Models • Output a set of likelihoods • Based on work by Tanja Schultz (CMU)

Summer at AFRL - Proposal • Besides acoustic models, new ASR systems for new languages have other needs • An ASR system needs a lexicon mapping phones-to-words • Normally hand-constructed • Require time and expertise

Summer at AFRL - Proposal • Our proposal: look at methods of bootstrapping new lexicons from: • Acoustic data • Word-level transcripts • Phonological feature detector outputs • How? • Start by looking at work on deriving Acoustic Sub-Word Units

Summer at AFRM - Proposal • Acoustic Sub-Word Units (ASWUs) • Similar to phones in that they are smaller pieces of words • BUT – automatically derived from acoustics instead of manually defined • Used to derive both a sub-word unit set and a lexicon for that set simultaneously • Research in this area has been mainly to improve ASR performance

Summer at AFRL - Proposal • Can we use these methods along with phonological features as inputs to induce new lexicons? • Using phonological features, the sub-word units may be mappable to standard IPA phone labels

Summer at AFRL - Proposal • The proposed system is inspired by an ASWU by (Singh et al., 2002) • Notable for not requiring word boundaries to be marked for training • Start with a basic dictionary (including a starting phoneset size) • Train a set of acoustic models on the training data with that dictionary • Alter the basic dictionary in a manner that improves your pronunciations • Repeat until a stopping criterion is reached

Summer at AFRL - Proposal • Start with a basic dictionary • Start with an assumption that the number of phones in a word is related to the number of letters in the orthography • Basic dictionary maps word to sequence of letters in that word: ABLE  A B L E BANNED  B A N N E D

Summer at AFRL - Proposal • Train a set of acoustic models • Using the basic dictionary, map words in the transcript to these “pronunciations” • Train an HMM-model using the output of the feature detectors as its input, and the above mapping as training labels

Summer at AFRL - Proposal • Alter the basic dictionary • Using some metric, find a candidate “phone” to be modified • We’ve looked at a couple of metrics – more on this later • Once the phone is identified, see if the phone should be “split” or “deleted” • A “split” indicates that the given phone label actually represents two different sounds, and so should be replaced with two different phone labels • A “delete” indicates that for a particular word or words the model fits better if that phone label is removed from the pronunciation

Summer at AFRL - Proposal • Split example: BE  B E DEVELOP  D E1 V E1 L O P • Delete examples: ABLE  A B L E :: ABLE  A B L ABANDONED  A B A N D O N D

Summer at AFRL - Proposal • For splits, all possible alterations are added to temporary lexicon • For deletes, we alter the HMM to add a possible deletion arc for the phone • After lexicon or HMM is altered, word transcript is force aligned using new possible pronunciations • Best pronunciations are pulled from this alignment and used to build new lexicon • Steps are repeated using the new lexicon in place of the basic lexicon

Summer at AFRL - Proposal • How do we determine the candidate “phone label” to alter? • Initially, modelled each phone with two Gaussians in the HMM • Compared the two Gaussians to each other using their KL-divergences • Took the phone label with the largest KL divergence as the one to alter • Idea was that each Gaussian described a cluster – the further these centers were from each other, the more probable they were describing two different phones

Summer at AFRL - Proposal • KL-divergence metric did not work well • System would pick candidates that a human would find unreasonable (such as “F” or “Q”) • System would split or delete these phones multiple times, continually returning to the same phone label

Summer at AFRL - Proposal • Why did the KL divergence perform this way? • Suspcion: Large variations in the two Gaussians in areas that do not matter for that phone pushed up the scores (e.g. vowel features for consonants) • Splitting these phones only allowed the coverage to spread wider, drawing the system back to those phones

Summer at AFRL - Proposal • What next? • Tried Mahalanobis distance metric, with poor results also • Returned to Acoustic Sub-Word papers for inspiration • Instead of looking at cluster stats, multiple papers use an average frame likelihood metric for each phone cluster to determine candidate phone for altering • Have started moving my code to use this framework – preliminary passes show promise, but no results quite yet

Conclusion – It’s 75 miles to Dayton • Advice for those thinking of doing work at WPAFB • Working in the SCREAM Lab was great • Hundreds of processors, tons of multi-lingual corpora • Friendly people, decent work environment (if a bit dark) • Many hoops to jump through, even just for a summer student • ID badges, computer usage training, etc. • Sometimes feels like you’re working at a corporation… • until the guys in uniform come around • The base is built like a campus crossed with a prison • cinderblock is the building material of choice. • Don’t forget your ID Badge • It’s 75 miles from Columbus to Dayton

Innovative Lexicon Bootstrapping from Phonological Features for ASR in Multilingual Settings

Innovative Lexicon Bootstrapping from Phonological Features for ASR in Multilingual Settings

Presentation Transcript

Vocabulary :

MY SUMMER VACATION!

2010 summer vacation

Seamless Summer Option

My Summer Vacation

AP HuGe

Top Ten Vacation Sites for the Summer

The best summer I have ever had

Open Source: What I Learned on My Summer Vacation

Pg. 93 -- Carolines favorite Summer vacation spot is Osage beach

How I Spent My Summer Vacation

What I did on my summer vacation !

Where I went for my summer vacation

My Summer Vacation Dreams

What I Did on My Summer Vacation

Summer Vacation

Summer Vacation

Go Balloons Organizes Balloon Art Summer Vacation 2016 Workshop Camp in Gurgaon

Top Summer Vacation Spots in India

Vacation with the Obamas

The Fun Activities at Summer Camps for Teens

Book Luxury tempo traveller Services Amritsar Tour