280 likes | 550 Views
Features for Factored Language Models for Code-Switching Speech. Heike Adel, Katrin Kirchhoff, Dominic Telaar , Ngoc Thang Vu, Tim Schlippe , Tanja Schultz. Outline. Introduction Motivation Seame Corpus Main Contributions Factored Language Models Features for Code-Switching Speech
E N D
Features for Factored Language Models for Code-Switching Speech Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc Thang Vu, Tim Schlippe, TanjaSchultz
Outline • Introduction • Motivation • Seame Corpus • Main Contributions • Factored Language Models • Features for Code-Switching Speech • Experiments • Conclusion • Summary Features for Factored Language Models for Code-Switching Speech
Motivation • Code-Switching (CS) = speechwithmorethanonelanguage • Exists in multilingual communitiesoramongimmigrants Challenges: multilingual modelsandCS trainingdatanecessary Features for Factored Language Models for Code-Switching Speech
SEAME corpus • SEAME = South East Asia Mandarin-English • Conversational speech, recorded from Singaporean and Malaysian speakers by [1] Challenges • much CS per utterance (Æ: 2.6) • short monolingual segments (mostly less than 1 sec, 2-4 words) • not much training data for LM (575k words) [1] Lyu, D.C. et al.,2010Originally used: research project ‘Code-Switch‘ (NTU and KIT) other particle EN MAN Features for Factored Language Models for Code-Switching Speech
Main contributions • Investigation of different featuresfor Code-Switchingspeech • Integration offactoredlanguagemodelsinto a dynamicone-pass decoder Features for Factored Language Models for Code-Switching Speech
Factored Language Models (FLMs) [2] • Idea: word = feature bundle • Good e.g. in the case of • Rich morphology • Few training data=> applicable to CS task • Generalized backoff F | F1 F2 F3 F | F1 F2 F | F2 F3 F | F1 F3 F | F1 F | F2 F | F3 F [2] Bilmes, J. and Kirchhoff, K., 2003 Features for Factored Language Models for Code-Switching Speech
Features: Words, POS, LID • Problems: • POS tagging of CS speech: challenging • Accuracy of POS tagger: unknown • different clustering method may be more robust Features for Factored Language Models for Code-Switching Speech
Features: Brown Word Clusters • Clusters based on word distributions in text [3] • minimize average mutual information loss • Best number of classes in terms of PPL: 70 • So far: clusters based on syntax or word distributions next step: semantic features • [3] Brown, P.F. Et al. ,1992 Features for Factored Language Models for Code-Switching Speech
Features: Open Class Words • Definition: content words, e.g. nouns, verbs, adverbs • “open” because class can be extended with new words, e.g. “Bollywood” open class words indicate semantic of sentence Features for Factored Language Models for Code-Switching Speech
Features: Open Class Word Clusters • Idea: • Semantic clusters in comparison to distribution based clusters (oc Brown clusters) OC word 1, OC word 2, OC word 3, …, OC word 8, OC word 9 OC word 2 OC word 5 OC word 9 Topic c Topic a OC word 4 OC word 7 OC word 8 OC word 1 OC word 3 OC word 6 Topic b Features for Factored Language Models for Code-Switching Speech
Features: Semantic OC Word Clusters • Clustering of open class word vectors • RNNLMs learn syntactic and semantic similarities [4] • RNNLMs represent words as vectors apply clustering to these word vectors • k-means clustering • spectral clustering [4] Mikolov, T. et al., 2013 Features for Factored Language Models for Code-Switching Speech
Features: Semantic OC Word Clusters • Experiments with different • Clustering methods • Brown, k-means, Spectral Clustering • Monolingual and bilingual clusters • Monolingual Clusters • Based on English and Mandarin Gigaword data (2005) • Bilingual Clusters • Based on CS text • Mixed lines of Gigaworddata • different numbers of clusters Lowest perplexity (247.24, but unclusteredoc words: 247.18): • Spectral Clustering • Bilingual Clusters • 800 OC word clusters Features for Factored Language Models for Code-Switching Speech
FLMs: Decoding Experiments • Interpolation weight of FLM and n-gram FLM weight Features for Factored Language Models for Code-Switching Speech
FLMs: Decoding Experiments (2) • Decoding results Features for Factored Language Models for Code-Switching Speech
Conclusion • Summary • Best features in termsof FLM perplexity: words + POS + Brown clusters + ocwords • Relative PPL reductionofupto 10.8% (eval) • Best features in termsof MER:words + POS + Brown clusters (+ occlusters) • Relative MER reductionofupto 3.4% (eval) Features for Factored Language Models for Code-Switching Speech
Thank you for your attention! Features for Factored Language Models for Code-Switching Speech
Features: Semantic OC Word Clusters (2) • Monolingual Clusters • Based on English and Mandarin Gigaword data (2005) • Factors for FLMs: words, part-of-speech tags, Brown word clusters and open class word clusters unclustered Baseline 3-gram: 268.39 Features for Factored Language Models for Code-Switching Speech
Features: Semantic OC Word Clusters (3) • Bilingual Clusters • Based on CS text or mixed lines of Gigaword data • Factors for FLMs: words, part-of-speech tags, Brown word clusters and open class word clusters 1000 2000 3000 250 2000 500 4000 800 6000 unclustered Baseline 3-gram: 268.39 Features for Factored Language Models for Code-Switching Speech
Factored Language Models • Features investigated in this study: • Words • Part-of-speech tags • Language information • Brown word clusters • Open class words • Open class word clusters Features for Factored Language Models for Code-Switching Speech
POS taggingof CS speech Post-processing Schultz, T. et al.: Detecting code-switch events based on textual features, 2010. POS taggerfor English Output Language islands (> 2 embedded words) „Matrix language“= Mandarin „Embedded language“ = English English segments in remainingtext Analysis CS-text Remainingtext POS taggerfor Mandarin Output Heike Adel - Integration of Syntactic and Semantic Features into Statistical Code-Switching Language Modeling
Features: Brown Word Clusters • Clusters based on word distributions in text • Find classes which minimize the average mutual information loss • Tool: SRILM toolkit • Number of classes determined based on PPL results on SEAME develop-ment set perplexity numberof classes Features for Factored Language Models for Code-Switching Speech
Features: Brown Word Clusters (2) • Brown word clusters: 70 classes • So far: clusters based on syntax or word distributions => next step: semantic features Features for Factored Language Models for Code-Switching Speech
Features: Open Class Words • Definition: content words, e.g. nouns, verbs, adverbs • “open” because class can be extended with new words, e.g. “Bollywood” => open class words indicate semantic of sentence Features for Factored Language Models for Code-Switching Speech
Features: Open Class Word Clusters OC word 1, OC word 2, OC word 3, …, OC word 8, OC word 9 • Idea: • Comparing bilingual and monolingual clusters • data: English and Chinese Gigaword data [2], CS corpus • Brown clustering of open class words • Clustering of open class word vectors [2] fifth edition; English: LDC2011T07, Chinese: LDC2011T13 OC word 2 OC word 5 OC word 9 Topic c Topic a OC word 4 OC word 7 OC word 8 OC word 1 OC word 3 OC word 6 Topic b Features for Factored Language Models for Code-Switching Speech
Open Class Word Clusters (2) • Spectral clustering with Graclus [4] [4] Dhillon, I.S. et al.: Weighted graph cuts without eigenvectors: A multilevel approach similarity graph Bisectionclustering 5k nodes mergenodes • Uncoarsengraph • weightedkernel k-meansclustering (bisectionclusteringresultsasinitialization) k classes Features for Factored Language Models for Code-Switching Speech
CS Trigger Ability of Different Factors • CS-rate: Features for Factored Language Models for Code-Switching Speech
FLMs: Significance tests • Perplexity results on eval set Features for Factored Language Models for Code-Switching Speech
FLMs: Significance tests • Mixed error rate results on eval set Features for Factored Language Models for Code-Switching Speech