Statistical language modeling combining n-gram and dependency grammar

Eran Chinthaka, Ikhyun Park Statistical language modeling combining n-gram and dependency grammar

Introduction • Statistical language models and ngrams • Problems with ngram models • Data sparseness • Long dependencies • Proposed Solution • Use a hybrid model of ngram and dependency grammar for language model

Process • Evaluator • Test Data • (Good and Bad) • Optimal Parameters • Perplexity

Training Data System Architecture

Experimental Setup • Data • Brown Corpus • 28671 -- Train sentences • 9557 -- Development Sentences • 9556 -- Test Sentences • Tools • Smoother and Language Model Builder • CMU-Cambridge Statistical Language Modeling Toolkit v2 (http://www.speech.cs.cmu.edu/SLM/toolkit.html) • Dependency Parser • Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml)

Sentence Evaluation • Ngram Score • Dependency Score • Combined Score

Smoothing – Absolute Discounting • Ngram Language Model if if else if else if else

Smoothing – Absolute Discounting • Dependency Language Model if else

Assessment • Perplexity (Ngram only) • Perplexity (Combined)  Inappropriate

Assessment • Classification of sentences (good vs bad) • Bad sentence generation • Shuffle good sentences • Eg :The election will be Dec. 4 from 8 a.m. to 8 p.m. . The election will be 8 8 from 4 a.m. to Dec. p.m. . • Shuffle degree = 7 (number of lost bigrams)

Results • Distribution of Sentences Ngram Avg. Shuffle: 12.357225 Dependency

Results Avg. Shuffle: 12.357225 • Classification (ngram vs. ngram+dep) False Reject NOT Improved -*- ngram -*- ngram+dep. False Accept

Discussion • Why no improvement • Insufficient feature exploration • Statistical nature of dependency parser • Any ideas?

Thank You !!

Statistical language modeling combining n-gram and dependency grammar