Bang-Xuan Huang Department of Computer Science & Information Engineering

Syntactic And Sub-lexical Features For Turkish Discriminative Language ModelsICASSP 2010EbruArısoy, Murat Sarac¸lar, Brian Roark, IzhakShafran Bang-Xuan Huang Department of Computer Science & Information Engineering National Taiwan Normal University

Outline • Introduction • Sub-lexical language models • Feature sets for DLM • Morphological Features • Syntactic Features • Sub-lexical Features • Experiments • Conclusions and Discussion

most words are formed by joining morphemes together Introduction • In this paper we make use of both sub-lexical recognition units and discriminative training in Turkish language models. • Turkish is an agglutinative language. • Its agglutinative nature leads to a high number of out-ofvocabulary (OOV) words which degrade the ASR accuracy. • To handle the OOV problem, vocabularies composed of sub-lexical units have been proposed for agglutinative languages. article A Syntactic(句法) sentence Ex: 今天下午需要開會 lexical or word

Introduction • DLM is a complementary approach to the baseline language model. • In contrast to the generative language model, it is trained on acoustic sequences with their transcripts to optimize discriminative objective functions using both positive (reference transcriptions) and negative (recognition errors) examples. • DLM is a feature-based language modeling approach. Therefore, each candidate hypothesis in DLM training data is represented as a feature vector of the acoustic input, x, and the candidate hypothesis, y. Feature vector candidate hypothesis Ex: N-best, lattice 1 2 3 4 sentence x 0 1 2 3 ….. i ….. …. ….

Sub-lexical models • In this approach, the recognition lexicon is composed of sub-lexical units instead of words. • Grammatically-derived units, stems, affixes or their groupings, and statistically-derived units, morphs, have both been proposed as lexical items for Turkish ASR. • Morphs are learned statistically from words by the Morfessor algorithm. Morfessor uses a Minimum Description Length principle to learn a sub-word lexicon in an unsupervised manner.

Feature sets for DLM • Morphological Features • Syntactic Features • Sub-lexical Features • Clustering of sub-lexical units • Brown et al.’s algorithm • minimum edit distance (MED) • Long distance triggers

Feature sets for DLM • Root (原型) ex: able => dis-able, en-able, un-able, comfort-able-ly, …. • Inflectional groups (IG) • Brown et al.’s algorithm - semantically-based, syntactically-based • minimum edit distance (MED) • 將一個字串轉成另一個字串所需的最少編輯(insertion, deletion, substitution)次數 • Ex: intension -> execution del ‘i’ => nttention sub ‘n’ to ‘e’ => etention sub ‘t’ to ‘x’ => exention ins ‘u’ => exenution sub ‘n’ to ‘c’ => execution

Feature sets for DLM • Long distance triggers • Considering initial morphs as stems and non-initial morphs as sufﬁxes, we assume that the existence of a morph can trigger another morph in the same sentence. • we extract all the morph pairs between the morphs of any two words in a sentence as the candidate morph triggers. • Among the possible candidates, we try to select only the pairs where morphs are occurring together for a special function.

Experiments

Conclusions and Discussion • The main contributions of this paper are (i) syntactic information is incorporated into Turkish DLM (ii) effect of language modeling units on DLMisinvestigated (iii) morpho-syntactic information is explored when using sub-lexical units. • It is shown that DLM with basic features yields more improvement for morphs than for words. • Our final observation is that the high number of features are masking the expected gains of the proposed features, mostly due to the sparseness of the observations per parameter. • This will make feature selection a crucial issue for our future research.

Weekly report • Generate word graph • Recognition result

MDLM-D + prior

MDLM-F vs MDLM-D + prior

Bang-Xuan Huang Department of Computer Science & Information Engineering