Language-Independent Phone Recognition

Language-Independent Phone Recognition Jui-Ting Huang, Mark Hasegawa-Johnson jhuang29@illinois.edu University of Illinois at Urbana-Champaign

Motivation • A*STAR challenge • Audio task: audio retrieval given IPA queries or waveforms • Need to transcribe the database/queries • Multilingual database and queries • We might encounter unseen (untrained) languages (Tamil, Malay…) • phone-based recognition instead of word-based recognition

Training data • 10 languages, 11 corpora • Arabic, Croatian, English, Japanese, Mandarin, Portuguese, Russian, Spanish, Turkish, Urdu • 95 hours of speech • Sampled from a larger set of corpora • Mixed styles of speech: broadcast, read, and spontaneous

Summarization of corpora

Phone set • Phonetic symbols: Worldbet • An ASCII encoding of the IPA + additional symbols for multi-languages • Convenient use for HTK • We have totally 205 phones • 196 distinct phones from 10 languages • Non-speech “phones”: • vocalic pause, nasalized pause, short pause, silence, noise, comma, period, question mark

IPA chart (consonants)

IPA chart (vowels)

Worldbet chart (consonants)

Worldbet chart (vowels)

Acoustic model ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn …. • Context-dependent triphone modeling • cross-word triphones • Punctuation marks and lexical stress are also considered as context • Language diacritics are created for each triphone • In total, we have 141530 distinct triphones • Spectral features: 39-dim PLP, cepstral mean/variance normalization per speaker • Modeling: HMMs with {11,13,15,17}-mixture Gaussians

Acoustic model(triphone clustering I) • State tying for triphone models • to ensure that all state distributions can be robustly estimated… • similar acoustic states of these triphones are clustered and tied • Number of states: 424573 -> 19485 [4.6%] total • Decision-tree-based clustering • Asking questions about the left and right contexts of each triphone • Each question split the pooled triphones to two acoustically different subsets

Acoustic model (triphone clustering II) • Categories for decision tree questions • Right or left context • Distinctive phone features (manner/place of articulation) • Language identity • Lexical stress • Punctuation mark ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn ….

Language model • Triphone bigram language model • equivalent to monophone quad-gram • Language-independent model • pool the phone-level transcriptions from all corpora together • Vocabulary size: top 60K frequent triphones (since 140K is too much!) • For the rest of infrequent triphones, map them back to center monophones

Recognition results • Test set: 50 sentences per corpus

Future work • Preparation of training data • unify non-speech tags across corpora • add more training data • For language-independent task • language model: interpolation between language-specific LMs • For language-dependent task • multi-lingual AM + language-specific LM (word-level recognition)

Language-Independent Phone Recognition