210 likes | 392 Views
Language Modeling using PLSA-Based Topic HMM. Atsushi Sako, Tetsuya Takiguchi and Yasuo Ariki Department of System and Engineering Kobe University. Interspeech 2007 August 27-31, 2007, Antwerp, Belgium. Motivations. Background Large volumes of multimedia content
E N D
Language Modeling usingPLSA-Based Topic HMM Atsushi Sako, Tetsuya Takiguchi and Yasuo Ariki Department of System and Engineering Kobe University Interspeech 2007 August 27-31, 2007, Antwerp, Belgium
Motivations • Background • Large volumes of multimedia content • Requiring automatic meta-data extraction • Domain • Sports-related live speech (especially baseball) • Commentary on the radio • Approach • Topic based approach Sophisticated ASR is important
Approach • Topic based approach • Several topic based n-gram • Transition probabilities between n-gram • Related studies • Probabilistic Latent Semantic Analysis (PLSA) [Hofmann’99] • PLSA based language model [Gildea’99]
PLSA based language model • Probabilistic Latent Semantic Analysis Topic distributions per document Observed word distributions Word distributions per topic
PLSA based language model • Unigram rescaling • PLSA can estimate only unigram probability • Combine unigram model with an trigram • Estimate P(z|d) from recognized History Obtained based on PLSA theorem Can not consider topic transition
Baseball live speech • Features • Repeating same topics • Pitching • Ball-counts • Chat and etc. • Similar order of utterances • Speech depends on a process of a game • For example: Describing batter Pitching Pitching results Chat with commentator …
Proposed Method • Different points from “History” • Considering topic transition probabilities • Not use History to estimate P(z|d) How to estimate P(z|d)? Topic HMM • Typical topic distribution • Topic transition probabilities
Topic HMM s Outputs topic distribution Unigram probability for state Topic based trigram Transition probability Ergodic HMM How to build?
Topic HMM • Decomposition by PLSA Time sequence Time sequence Topic distribution per utterance Topic distribution per utterance Word distribution per latent topic Word-utterance co-occurrence probability matrix
Topic HMM • Learning process of Topic HMM • Topic distribution as a vector in topic space • Clusters of similar distribution • HMM is leaned a cluster as a state • Typical topic distribution as mean vector of a state cluster state cluster state cluster state
Brief summary unigram rescaling The Goal PLSA theorem Latent topics Mean vector of a state Topic HMM
Experiments Experiments
Corpus • Commentary speech on radio • Vocabulary size: 3k • Amount of data • Unit of an utterance: • Training set: segmented by period • Test set: segmented by level and zero-cross Language model Topic HMM
Experimental condition • Language model (baseline) • trigram • Vocabulary size: 3k • Training data: 80k words • Topic HMM • Training data: 9k utterances • Dimension of topic distribution: parameter • The number of HMM states: parameter • Experiment in variable parameters
Experimental condition • Acoustic model settings • Adapted by MLLR+MAP Acoustic H M M
Experimental results • Results in word accuracy 67.8 69.9 Acc 80 70 60 50 40 # of latent topic s of PLSA 30 20 10 • Transition probability • Typical topic distribution 10 20 30 40 50 60 # of states of TOPIC HMM
Summary • Proposed Topic HMM • Typical topic distributions as states • Topic transition probabilities • Improved the word accuracy • 66.5% to 69.9% (+3.4%) • Contribution of topic transition was 0.5 to 1% • Future work • Automatic decision of Topic HMM topology • Experiments in other domain of task
Effect of transition probability • Effect of LM transition #T=15 #T=30 #T=60 70.5 The less states, the more effective Improvement about 0.5 to 1.0% 70 69.5 69 68.5 Word Accuracy 68 67.5 67 66.5 0 2 5 20 10 30 0.2 0.7 1.1 Topic transition weight
Prospect • Examples of improvement • Effect of LM transition probability? • Previous utterance: Threw first ball to the batter. • Trigram : Tamura rin (means “name and average”) • Proposed: Karaburi (means “Strike out”)