1 / 18

Features for Factored Language Models for Code-Switching Speech

Features for Factored Language Models for Code-Switching Speech. Heike Adel, Katrin Kirchhoff, Dominic Telaar , Ngoc Thang Vu, Tim Schlippe , Tanja Schultz. Outline. Introduction Motivation Seame Corpus Main Contributions Factored Language Models

avye-gentry
Download Presentation

Features for Factored Language Models for Code-Switching Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Features for Factored Language Models for Code-Switching Speech Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc Thang Vu, Tim Schlippe, Tanja Schultz

  2. Outline • Introduction • Motivation • Seame Corpus • Main Contributions • Factored Language Models • Features for Code-Switching Speech • Experiments • Conclusion • Summary Features for Factored Language Models for Code-Switching Speech

  3. Motivation • Code-Switching (CS) = speechwithmorethanonelanguage • Exists in multilingual communitiesoramongimmigrants Challenges: multilingual modelsandCS trainingdatanecessary Features for Factored Language Models for Code-Switching Speech

  4. SEAME corpus • SEAME = South East Asia Mandarin-English • Conversational speech, recorded from Singaporean and Malaysian speakers by [1] Challenges • much CS per utterance (Æ: 2.6) • short monolingual segments (mostly less than 1 sec, 2-4 words) • not much training data for LM (575k words) [1] Lyu, D.C. et al.,2010Originally used: research project ‘Code-Switch‘ (NTU and KIT) other particle EN MAN Features for Factored Language Models for Code-Switching Speech

  5. Main contributions • Investigation of different featuresfor Code-Switchingspeech • Integration offactoredlanguagemodelsinto a dynamicone-pass decoder Features for Factored Language Models for Code-Switching Speech

  6. Factored Language Models (FLMs) [2] • Idea: word = feature bundle • Good e.g. in the case of • Rich morphology • Few training data=> applicable to CS task • Generalized backoff F | F1 F2 F3 F | F1 F2 F | F2 F3 F | F1 F3 F | F1 F | F2 F | F3 F [2] Bilmes, J. and Kirchhoff, K., 2003 Features for Factored Language Models for Code-Switching Speech

  7. Features: Words, POS, LID • Problems: • POS tagging of CS speech: challenging • Accuracy of POS tagger: unknown •  different clustering method may be more robust Features for Factored Language Models for Code-Switching Speech

  8. Features: Brown Word Clusters • Clusters based on word distributions in text [3] • minimize average mutual information loss • Best number of classes in terms of PPL: 70 • So far: clusters based on syntax or word distributions  next step: semantic features • [3] Brown, P.F. Et al. ,1992 Features for Factored Language Models for Code-Switching Speech

  9. Features: Open Class Words • Definition: content words, e.g. nouns, verbs, adverbs • “open” because class can be extended with new words, e.g. “Bollywood”  open class words indicate semantic of sentence Features for Factored Language Models for Code-Switching Speech

  10. Features: Open Class Word Clusters • Idea: • Semantic clusters in comparison to distribution based clusters (oc Brown clusters) OC word 1, OC word 2, OC word 3, …, OC word 8, OC word 9 OC word 2 OC word 5 OC word 9 Topic c Topic a OC word 4 OC word 7 OC word 8 OC word 1 OC word 3 OC word 6 Topic b Features for Factored Language Models for Code-Switching Speech

  11. Features: Semantic OC Word Clusters • Clustering of open class word vectors • RNNLMs learn syntactic and semantic similarities [4] • RNNLMs represent words as vectors  apply clustering to these word vectors • k-means clustering • spectral clustering [4] Mikolov, T. et al., 2013 Features for Factored Language Models for Code-Switching Speech

  12. Features: Semantic OC Word Clusters • Experiments with different • Clustering methods • Brown, k-means, Spectral Clustering • Monolingual and bilingual clusters • Monolingual Clusters • Based on English and Mandarin Gigaword data (2005) • Bilingual Clusters • Based on CS text • Mixed lines of Gigaworddata • different numbers of clusters  Lowest perplexity (247.24, but unclusteredoc words: 247.18): • Spectral Clustering • Bilingual Clusters • 800 OC word clusters Features for Factored Language Models for Code-Switching Speech

  13. FLMs: Decoding Experiments • Interpolation weight of FLM and n-gram FLM weight Features for Factored Language Models for Code-Switching Speech

  14. FLMs: Decoding Experiments (2) • Decoding results Features for Factored Language Models for Code-Switching Speech

  15. Conclusion • Summary • Best features in termsof FLM perplexity: words + POS + Brown clusters + ocwords • Relative PPL reductionofupto 10.8% (eval) • Best features in termsof MER:words + POS + Brown clusters (+ occlusters) • Relative MER reductionofupto 3.4% (eval) Features for Factored Language Models for Code-Switching Speech

  16. Thank you for your attention! Features for Factored Language Models for Code-Switching Speech

  17. References Features for Factored Language Models for Code-Switching Speech

  18. References Features for Factored Language Models for Code-Switching Speech

More Related