1 / 1

Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model

but_IN. stocks_VBZ. kept_VBN. </s>. <s>. but_CC. stocks_NNS. kept_VBD. falling_VBG. Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model Jia Cui, Yi Su, Keith Hall, Frederick Jelinek @clsp.jhu. Example: A sentence in bigram METLM. Abstract:

curt
Download Presentation

Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. but_IN stocks_VBZ kept_VBN </s> <s> but_CC stocks_NNS kept_VBD falling_VBG Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model Jia Cui, Yi Su, Keith Hall, Frederick Jelinek @clsp.jhu Example: A sentence in bigram METLM Abstract: We propose a novel language model METLM (maximum entropy token-based language model) capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) with ME transition distributions. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system. token ME Training With Labeled Training Data: Data Sparseness and Sharing: “Colin plays chess” has no Google results, but it is possible. WW: kept falling W-T: falling-VBG WT: kept VBG TWT: NNS kept VBG AA: WT,TW,TT,W-T,T …… But CC stocks NNS kept VBD falling VBG Colin plays chess chess he plays chess Colin takes basketball Colin plays • Data sharing depends on knowledge: • Lexical: Colin plays and he plays share the word plays • Syntactic: takes and plays are both VERBS • Semantic: basketball and chess are both SPORTS • ... ... Word/label based features will not increase data sparseness. Word Classes/Labels: PI-CLS: word classification using algorithm proposed by Brown et. al, 1992 PD-CLS: position-dependent word classes, classifying words at three positions simultaneously, classes generated by Ahamad Emami Proximity-based word classes: word distances computed by Dekang Lin (stock,C1= ={cost, currency, credit, salary, refund, hourly}) Dependency- based word classes: word distances computed by Dekang Lin, (stock,C2={bond, stock, cash, capacity, decoration}) Topic-based word classes: distances computed by Yonggang Deng (stock,C3={indexes, exchange, Chicago, crash, broker, unfolded} Experiments of Perplexities: • Data: Treebank WSJ 24 sections • Develop: 0-19 sections, 41K sentences,1M words • Held: 20-21 sections, 4.3K sentences,110K words • Test: 22-23 sections,4.2K sentences,106K words • 10K vocabulary baseline Experiments on ASR system: Fisher Data, 22M training data, Dialog, 4167 reference sentences. Lattice re-scoring: 2.7M predictions Use dominant POS tags Basic Features + AA, WTW,WWT, WWTW,WTWW • Conclusions: • Addressing data sharing in both history and future . • Enabling Unlabeled training • Effectively applying Syntactic word labels to improve WER. • A platform to integrate different word labels. • Computationally expensive

More Related