150 likes | 166 Views
Language-Independent Phone Recognition. Jui-Ting Huang, Mark Hasegawa-Johnson jhuang29@illinois.edu University of Illinois at Urbana-Champaign. Motivation. A*STAR challenge Audio task: audio retrieval given IPA queries or waveforms Need to transcribe the database/queries
E N D
Language-Independent Phone Recognition Jui-Ting Huang, Mark Hasegawa-Johnson jhuang29@illinois.edu University of Illinois at Urbana-Champaign
Motivation • A*STAR challenge • Audio task: audio retrieval given IPA queries or waveforms • Need to transcribe the database/queries • Multilingual database and queries • We might encounter unseen (untrained) languages (Tamil, Malay…) • phone-based recognition instead of word-based recognition
Training data • 10 languages, 11 corpora • Arabic, Croatian, English, Japanese, Mandarin, Portuguese, Russian, Spanish, Turkish, Urdu • 95 hours of speech • Sampled from a larger set of corpora • Mixed styles of speech: broadcast, read, and spontaneous
Phone set • Phonetic symbols: Worldbet • An ASCII encoding of the IPA + additional symbols for multi-languages • Convenient use for HTK • We have totally 205 phones • 196 distinct phones from 10 languages • Non-speech “phones”: • vocalic pause, nasalized pause, short pause, silence, noise, comma, period, question mark
Acoustic model ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn …. • Context-dependent triphone modeling • cross-word triphones • Punctuation marks and lexical stress are also considered as context • Language diacritics are created for each triphone • In total, we have 141530 distinct triphones • Spectral features: 39-dim PLP, cepstral mean/variance normalization per speaker • Modeling: HMMs with {11,13,15,17}-mixture Gaussians
Acoustic model(triphone clustering I) • State tying for triphone models • to ensure that all state distributions can be robustly estimated… • similar acoustic states of these triphones are clustered and tied • Number of states: 424573 -> 19485 [4.6%] total • Decision-tree-based clustering • Asking questions about the left and right contexts of each triphone • Each question split the pooled triphones to two acoustically different subsets
Acoustic model (triphone clustering II) • Categories for decision tree questions • Right or left context • Distinctive phone features (manner/place of articulation) • Language identity • Lexical stress • Punctuation mark ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn ….
Language model • Triphone bigram language model • equivalent to monophone quad-gram • Language-independent model • pool the phone-level transcriptions from all corpora together • Vocabulary size: top 60K frequent triphones (since 140K is too much!) • For the rest of infrequent triphones, map them back to center monophones
Recognition results • Test set: 50 sentences per corpus
Future work • Preparation of training data • unify non-speech tags across corpora • add more training data • For language-independent task • language model: interpolation between language-specific LMs • For language-dependent task • multi-lingual AM + language-specific LM (word-level recognition)