770 likes | 968 Views
Lexicon Optimization Approaches for Language Models of Agglutinative Language. Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving laboratory, Kyoto University Xinjiang University. 1. Outline. Introduction Review of lexicon optimization methods
E N D
Lexicon Optimization Approaches for Language Models of Agglutinative Language MijitAblimit Tatsuya Kawahara AskarHamdulla Media archiving laboratory, Kyoto University Xinjiang University 1
Outline Introduction Review of lexicon optimization methods Morphological segmenter for Uyghur language; ASR results based on various morphological units Morpheme concatenation approaches by comparing two layers of ASR results Discriminative lexicon optimization approaches
Introduction to Uyghur • The Uyghur (Uyghur: ئۇيغۇر, Uyghur; simplifiedChinese: 维吾尔) are a Turkic ethnic group living in Eastern and Central Asia. • Today Uyghur people live primarily in the Xinjiang Uyghur Autonomous Region in the People's Republic of China.
Uyghur language • Uyghur Language belongs to Turkish Language Family of Altaic Language system. • Present day Uyghur language is written in Arabic letters with some modifications. • Writing direction: Right-to-Left • Sentences in Uyghur consist of words which are separated by space or punctuation marks.
Uyghur Language (Morphological structure) • Uyghur language is an agglutinative language in which words are formed by suffixes attaching to a stem (or root). • Each word consists of a root (or stem) attached in the rear by zero to many (longest is about 10 suffixes or may be more) suffixes without any splitter between them. • And a few words can be added with a prefix in the head of a stem, and only 7 (difficult to find more) prefixes are used. Thus the morpheme structure of Uyghur word is “prefix + stem (or root) + suffix1 + suffix2 + …”
Uyghur Language (Morphological structure) • Suffixes that make semantic changes to a root are derivational suffixes. Suffixes that make syntactic changes to a root are inflectional suffixes. • When a root linked with the derivational suffixes becomes a stem. So the root set is included in the stem set. Sometimes the words “stem” and “root” are used without distinguishing. • have number of phonetic phenomena such as vowel weakening, phonetic harmony etc.
Uyghur Language (Morphological structure) • Morpheme is the smallest meaning bearing unit. • Here, morpheme is refer to any one of prefix, stem, or suffix. • Morpheme consists of syllable (s), • And syllable is the sequence of phonemes.
Uyghur2English Character Transcription rule for Inner Processing
1 Introduction The practical formulation for ASR: • P(U|X) Acoustic Model (AM) • P(X|S) Lexical Model • P(S) Language Model (LM) • U is the acoustic features of speech • S is the corresponding text • X is phoneme sequence Text S is represented by smaller units t previous n-1 units
1 Introduction Problems with agglutinative languages • Morpheme unit provides optimal lexicon. • Good coverage and small vocabulary size. • Preserve linguistic information including both word and morpheme boundaries. • Applicable to NLP, like information retrieval, machine translation. • But • too short, easily confused • Previous Rule-based or data-driven optimization. • Based on frequency and length. • Splitting less frequent units while keeping frequent units. • May not outperform word lexicon
1 Introduction Problems with agglutinative languages • Unsupervised segmentation methods reported to have better ASR performance for agglutinative languages. • Not carrying linguistic information. • Dependant on training data. • ASR system is related with both text and speech. • Previous methods are based on only text. • Rarely considering phonological similarity related with ASR system. • There are co-articulation problems related with unit selection. • Phonetic changes : phonetic harmony or disharmony. • Morphological changes: omission, insertion.
2 Review of Lexicon optimization methods Evaluations for ASR system • Word Error Rate (WER), Morpheme (syllable, character)Error Rate (MER), SER, CER : • WER MER, SER, CER • Lexicon size. • Perplexity: Entropy on test corpus S Nt : unit token Nw: word token
2 Review of Lexicon optimization methods Data driven approaches • Merging short and frequently co-occurred units while splitting infrequent and long units. • Morphological attributes. • Mutual Bigram (MB) can provide a threshold. • These methods can reduce WER or lexicon size, • but over-generalized.
2 Review of Lexicon optimization methods Statistical modeling approaches • T. Shinozaki and S. Furui (2002) build some evaluation functions based on WER and unit properties, and iteratively trained on ASR results. • Perplexity is calculated for basic phoneme units as a criterion for building new units (K. Hwang, 1997). • Merging based on largest log likelihood increase. • Sub-word units is used to detect OOV (Choueiter 2009, Parada 2011), and other methods are used to solve this problem: • Sub-word modeling. • Online learning of OOV words.
2 Review of Lexicon optimization methods Unsupervised lexicon extraction • Discover lexicon from untagged corpora (Goldsmith 2001, Brent 1999, Jurafsky 2000, Chen 1998, Goldwater 2009, etc.). • Bayesian framework utilized for unit selection to overcome overfitting problem of Maximum Likelihood estimation (Goldwater, Brent, etc.). • Maximum a Posteriori (MAP) is utilized for different structures (Creutz 2005 et all.). (frequency and length properties are utilized)
2 Review of Lexicon optimization methods Discriminative language modeling approach • Discriminative approaches are used to optimize language model (Brian Roark, Murat Saraclar et al 2004) not lexicon. • ASR samples are utilized as positive and negative samples. • Various features are trained on ASR results. • Discriminative lexicon optimization method proposed in this research is inspired by this method.
2 Review of Lexicon optimization methods Lexicon optimization approaches for various languages • A lot of lexicon optimization researches have been done for agglutinative languages like Japanese, Korean, Turkish, and other highly inflectional languages like Finnish, German, Estonian, Arabic, Thai. • Agglutinative languages have a clear morphological structure compared to other fusional languages. • Stem + suf1 + suf2 + … + sufn • It is possible to extract linguistic sub-word units with high accuracy. • All inflectional languages investigated methods based on data-driven methods or utilized frequency and length.
2 Review of Lexicon optimization methods Lexicon optimization approaches for Turkish • Murat Saraclar (2007) and K. Hacioglu (2003) compared all morphological units based ASR results. • Linguistic morphemes are extracted based on two-level-morphology. • Statistical morphemes are extracted by Morfessor program. • Mutual probability threshold is used. • Stem-ending approach used frequency based optimization and have improved results. • K. Oflazer (2005) et al investigated multi-layer sub-word optimization. • Based on the stem-ending units, or syllable based units are adopted for OOV. • Discriminative language modeling applied for Turkish, and 1% improvement reported.
2 Review of Lexicon optimization methods Lexicon optimization approaches for Japanese and Korean • No word delimiters in Japanese. Both supervised and unsupervised lexicon extraction is investigated. • L.M. Tomokiyo and K. Ries (1998) investigated a concatenative approach based on perplexity. R.K. Ando and Lillian Lee (2000) investigated segmentation based on three types of Japanese characters, kanji, hiragana, katakana, which often indicate word boundary. • Unsupervised lexicon extraction methods are investigated by M. Nagata (1997) based on a small number words iteratively augmented on a raw corpus. • Bayesian model (Mochihashi 2009) is applied for Japanese and Chinese lexicon extraction. • Lexicon optimization is widely used for Korean. • O.W. Kwon, K. Hwang et al (2003) morpheme or pseudo morpheme concatenative based on morphological and phonological properties. • O.W. Kwon (2000) applied a holistic approaches of rule based and statistical approaches.
2 Review of Lexicon optimization methods Lexicon optimization approaches for Other inflectional languages • The Morfessor program is used to investigate many inflectional languages like Finnish, Turkish, Arabic, Estonian, Basque etc. • Morphological phonological criteria along with statistical methods are applied for many languages like, German, Arabic, Thai etc . • Improved ASR results are reported.
1 Introduction Our approaches to lexicon optimization • 1. Linguistic morpheme unit segmentation. • Can preserve linguistic information, convenient for downstream NLP processing such as IR, MT. • Supervised segmentation is necessary. • 2. Feature extraction from two layers of ASR results. • Directly linked to the ASR accuracy of word and morpheme units. • Features like Frequency, length, morphological attributes are utilized to extract useful samples. • 3. Discriminative approach to morpheme concatenation. • Automatically learn the effective features. • Machine learning methods like SVM and perceptron are used.
1 Introduction Flow chart of the proposed lexicon optimization
Morpheme segmentation in Uyghur language and Baseline ASR results
3 Sub-word segmentation in Uyghur language and Baseline ASR results • Outline • Morphological segmenters for Uyghur language • Statistical properties based on various units • ASR results based on various morphological units
3.1 Morphological segmenters for Uyghur language Uyghur language and morphology • Uyghur language is an agglutinative language, belongs to Turkish Language Family of Altaic Language system. • Müshükningkǝlgininikorgǝnchashqanhoduqupqachti. • (The mouse who saw the cat coming was startled and escaped.) • words are separated naturally • morpheme sequence: format “ prefix + stem + suffix1 + suffix2 + … ” • Müshük+ningkǝl+gǝn+i+nikor+gǝnchashqanhoduq+upqach+ti. • Suffer from phonological morphological changes • syllable sequence: format “CV[CC]” (C: consonant; V: vowel) • Mü+shük+ningkǝl+gi+ni+nikor+gǝnchash+qanho+du+qupqach+ti.
3.1 Morphological segmenters for Uyghur language Inducing Uyghur morphemes • Stem + suffix structure, few prefixes (7 in this research). • Stems are longer, and independent linguistic units, fairly unchanged after suffixation. • 2 types of suffixes: Derivational and Inflectional. • Stem and suffix boundary is chosen as the primary target of segmentation. • First , split to stem and word-ending. • Second, word-ending split to single suffixes.
3.1 Morphological segmenters for Uyghur language Phonological and morphological changes in surface forms • Insertion, deletion, phonetic harmony, and disharmony (assimilation, or weakening). • The tense, person, genitive …all expressed in suffix. • 1) assimilation is recovered to standard surface forms. • almiliring = alma+lar+ing • 2) morphological change, which is deletion and insertion. • oghli = oghul + i ; binaying = bina+[y]+ing • 3) phonetic harmony. • Kyotodin = Kyoto + din; Newyorktin = Newyork + tin • 4) ambiguity. • berish = bar(go/have)+ish; berish = bǝr(give)+ish
3.1 Morphological segmenters for Uyghur language Inducing Uyghur morphemes – rule based segmentation results • We segment 18,400 word list to morphemes, the accuracy is around 92%. • Based on vocabulary, difficult to generalize to text data. • When incorporated with some specific applications, like spell checker, dictionary, or search engine, some revisions are necessary. • Syllable structure is C+V+C+C (V is vowel, C is consonant) except some syllables are imported from Chinese or European languages. • Rule Syllable segmentation result is higher than 99%.
3.1 Morphological segmenters for Uyghur language Supervised morpheme segmentation-Statistical modeling • A statistical model can be trained in a fully supervised way. A text and its manual segmentation is prepared. • A text corpus of 10025 sentences, collected from general topics, and their manual segmentations are prepared. • More than 30K stems are prepared independently and used for the segmentation task.
3.1 Morphological segmenters for Uyghur language Probabilistic model for morpheme segmentation • Intra-word bi-gram probabilistic formulation is: Surface realization is considered. Standard morpheme format is exported. • For a candidate word, all the possible segmentation results are extracted before their probabilities are computed to get the best result.
3.1 Morphological segmenters for Uyghur language Morpheme segmentation accuracy and coverage • Word coverage is 86.85%. Morpheme coverage is 98.44%. • The morpheme segmentation accuracy is 97.66%
3.2 Statistical properties based on various units n-gram language models based on various morphological units – corpora preparation • A corpus of about 630k sentences from general topics like novels, newspapers, books (newspaper, novels, science...). • The surface forms of morphemes are kept unchanged. • Linguistic information, like word and morpheme boundaries are preserved.
3.2 Statistical properties based on various units Tri-gram language models based on various units
3.2 Statistical properties based on various units Vocabulary comparison of various units Vocabulary size explode by words
3.2 Statistical properties based on various units Perplexity comparison of various n-grams, normalized by words • Perplexities by various unit sets will converge to similar results. • Slight gain by longer units with smaller size of n. • Morpheme slightly outperformed word, because of small OOV rate.
3.3 ASR results based on various morphological units Uyghur ASR experiments -Uyghur Acoustic Model • Training speech corpus is selected from general topics. And used for Uyghur acoustic model (AM) building. • Test corpus is 550 different sentences from news, each sentence is read by at least one male and one female.
3.3 ASR results based on various morphological units ASR systems based on various morphological units Four different language models are built. 1) Word based model 3) Stem-Suffix (word endings) based model 2) Morpheme based model 4)Syllable based model • The syllable vocabulary is 6.58k and the syllable error rate is 28.73%. • Word-based ASR result is automatically segmented to morphemes and syllables. Corresponding MER is 18.88%, SER is 15.42%.
3 Sub-word segmentation in Uyghur language and Baseline ASR results Conclusion • Supervised morphological unit segmentation achieved 97.6% for Uyghur language. • Morpheme provides syntactic and semantic information which is convenient for ASR and NLP researches. • Uyghur LVCSR system on various linguistic units are built. • Longer units (word) outperform other sub-word units in ASR application.
Morpheme Concatenation Approach based on feature extraction from two layers of ASR results
4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results Outline • Corpora and Baseline systems • Problematic sample extraction • Experimental results