310 likes | 620 Views
Language Model. CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hk. Outline. Language models Finite automata and language models Types of language models Multinomial distributions over words Query likelihood model Application Q&A
E N D
Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hk
Outline • Language models • Finite automata and language models • Types of language models • Multinomial distributions over words • Query likelihood model • Application • Q&A • Reference
Language Models (LMs) • How can we come up with good queries? • Think of words that would likely appear in a relevant document. • Idea of LM: • A document is a good match to a query if the document model is likely to generate the query.
Language Models (LMs) • Generative Model: • Recognize or generate strings. • The full set of strings that can be generated is called the language of the automaton. • Language Model: • A function that puts a probability measure over strings drawn from some vocabulary.
Language Models (LMs) • Example 1: • Calculate the probability of a word sequence. • Multiply the probabilities that the model gives to each word in the sequence, together with the probability of continuing or stopping after producing each word. • P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01) • *(0.8*0.8*0.8*0.8*0.8*0.8*0.2) • =0.000000000001573 • Most of the time, we will omit to include STOP and (1-STOP) probabilities.
Language Models (LMs) • Example 2: P(s|M1)>P(s|M2)
Language Models (LMs) • Basic LM using chain rule: • Unigram language model: • Throws away all conditioning context. • Most used in Information Retrieval. • Bigram language model: • Condition on the previous term.
Language Models (LMs) • Unigram LM: • Bag-of-words model. • Multinomial distributions over words. multinomial coefficient, can leave out in practical calculations. The length of document d. M is the size of the vocabulary.
Query Likelihood Model • Query likelihood model: • Rank document by P(d|q) • Likelihood that document d is relevant to the query. • Using Bayes rule: • P(q) is the same for all documents. • P(d) is treated as uniform across all d.
Query Likelihood Model • Multinomial + Unigram: • Retrieve based on a language model: • Infer a LM for each document. • Estimate P(q|Mdi). • Rank the documents according to these probabilities. Multinomial coefficient for the query q. Can be ignored.
Query Likelihood Model • Estimating the query generation probability: • Maximum Likelihood Estimation (MLE) + unigram LM • Limitations: • If we estimate P(t|Md)=0, documents will only give a query nonzero probability if all of the query terms appear in the document. • Occurring words are poorly estimated, the probability of words occurring once in the document is overestimated, because their one occurrence was partly by chance.
Query Likelihood Model • Estimating the query generation probability: • Maximum Likelihood Estimation (MLE) + unigram LM • Smoothing: • Use the whole collection to smooth. • Linear Interpolation (Jelinek-Mercer Smoothing) • Bayesian Smoothing
Query Likelihood Model • Query likelihood model with linear interpolation: • Query likelihood model with Bayesian smoothing:
Query Likelihood Model • Example using unigram + MLE + linear interpolation: • d1: Xyzzy reports a profit but revenue is down • d2: Quorus narrows quarter loss but revenue decreases further • λ=1/2 • query: revenue down • ranking: d1>d2
Application • Community-based Question Answering (CQA) System: • Question Search. • Given a queried question, find a semantically equivalent question for the queried question. • General Search Engine • Given a query, rank documents.
Reference • Multinomial distribution: • http://en.wikipedia.org/wiki/Multinomial_distribution • Likelihood function: • http://en.wikipedia.org/wiki/Likelihood • Maximum likelihood: • http://en.wikipedia.org/wiki/Maximum_likelihood