120 likes | 146 Views
Learn about statistical language models, probabilities of word sequences, N-gram models, smoothing techniques, and applications in NLP like speech recognition and machine translation. Discover how language models estimate probabilities and overcome data sparseness challenges. Explore the Zero-Frequency Problem, phonetic tree models, and the significance of n-grams in predicting symbols within text. Dive into whole-sentence language models and their versatility in incorporating various computational features.
E N D
Language Model for Machine Translation Jang, HaYoung
What is a Language Model? • Probability distribution over strings of text • How likely is a string in a given “language”? • Probabilities depend on what language we’re modeling p1 = P(“a quick brown dog”) p2 = P(“dog quick a brown”) p3 = P(“быстрая brown dog”) p4 = P(“быстраясобака”) In a language model for English: p1 > p2 > p3 > p4 In a language model for Russian: p1 < p2 < p3 < p4
Language Model from Wikipedia • A statistical language model assigns a probability to a sequence of words P(w1..n) by means of a probability distribution. • Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval. Estimating the probabilty of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models. • In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence. • When used in information retrieval, a language model is associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md).
P( ) P( ) P( ) P( ) P( )= Unigram Language Model • Colored balls are randomly drawn from an urn (with replacement) M words = (4/9) (2/9) (4/9) (3/9)
M P ( )=1/2 P ( )=1/4 P ( )=1/4 P( ) P( ) P( ) P( )= P( ) Zero-Frequency Problem • Suppose some event is not in our observation S • Model will assign zero probability to that event Sequence S !! = (1/2) (1/4) 0 (1/4) = 0
Smoothing • The solution: “smooth” the word probabilities P(w) Maximum Likelihood Estimate Smoothed probability distribution w
Phonetic Tree with n-gram Model TH 1 Trigram R U 0.1 T L Tell the 0.5 1 E 0.1 U 0.02 1 1 TH 1 R U 0.1 T Bigram L the 0.5 1 E 0.1 U 1 1 TH 0.02 1 R U 0.1 T Unigram L 0.5 1 E 0.1 U 0.02
n-grams • n-gram • A sequence of n symbols • n-gram Language Model • A model to predict a symbol in a sequence, given its n-1 predecessors • Why use them? • Estimate the probability of a symbol in unknown text, given the frequency of its occurrence in known text
Problems with n-grams • More n-grams than those that can be observed • Sensitivity to the genre of the training text • Newpaper articles • Personal letters • Fixed n-gram Vocabulary • Any additions lead to re-compilation of the n-gram model
Whole-Sentence Language Model • The main advantage of WSME is its ability to freely incorporate arbitrary computational features into a single statistical model. The features can be: • Traditional N-gram features (bigram, trigram) • Long distance N-grams (triggers, d-2 ngram) • Class based N-gram • Syntactic features (PCFG, link grammar, dependency info.) • Other features (sentence length, dialogue features, etc)
Reference • Estimation of probabilities from sparse data for the language model component of a speech recognizer, Katz, S. • Class-based n-gram models of natural language, Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai • Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb '05 - First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005.