Language Model

Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hk

Outline • Language models • Finite automata and language models • Types of language models • Multinomial distributions over words • Query likelihood model • Application • Q&A • Reference

Language Models (LMs) • How can we come up with good queries? • Think of words that would likely appear in a relevant document. • Idea of LM: • A document is a good match to a query if the document model is likely to generate the query.

Language Models (LMs) • Generative Model: • Recognize or generate strings. • The full set of strings that can be generated is called the language of the automaton. • Language Model: • A function that puts a probability measure over strings drawn from some vocabulary.

Language Models (LMs) • Example 1: • Calculate the probability of a word sequence. • Multiply the probabilities that the model gives to each word in the sequence, together with the probability of continuing or stopping after producing each word. • P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01) • *(0.8*0.8*0.8*0.8*0.8*0.8*0.2) • =0.000000000001573 • Most of the time, we will omit to include STOP and (1-STOP) probabilities.

Language Models (LMs) • Example 2: P(s|M1)>P(s|M2)

Language Models (LMs) • Basic LM using chain rule: • Unigram language model: • Throws away all conditioning context. • Most used in Information Retrieval. • Bigram language model: • Condition on the previous term.

Language Models (LMs) • Unigram LM: • Bag-of-words model. • Multinomial distributions over words. multinomial coefficient, can leave out in practical calculations. The length of document d. M is the size of the vocabulary.

Query Likelihood Model • Query likelihood model: • Rank document by P(d|q) • Likelihood that document d is relevant to the query. • Using Bayes rule: • P(q) is the same for all documents. • P(d) is treated as uniform across all d.

Query Likelihood Model • Multinomial + Unigram: • Retrieve based on a language model: • Infer a LM for each document. • Estimate P(q|Mdi). • Rank the documents according to these probabilities. Multinomial coefficient for the query q. Can be ignored.

Query Likelihood Model • Estimating the query generation probability: • Maximum Likelihood Estimation (MLE) + unigram LM • Limitations: • If we estimate P(t|Md)=0, documents will only give a query nonzero probability if all of the query terms appear in the document. • Occurring words are poorly estimated, the probability of words occurring once in the document is overestimated, because their one occurrence was partly by chance.

Query Likelihood Model • Estimating the query generation probability: • Maximum Likelihood Estimation (MLE) + unigram LM • Smoothing: • Use the whole collection to smooth. • Linear Interpolation (Jelinek-Mercer Smoothing) • Bayesian Smoothing

Query Likelihood Model • Query likelihood model with linear interpolation: • Query likelihood model with Bayesian smoothing:

Query Likelihood Model • Example using unigram + MLE + linear interpolation: • d1: Xyzzy reports a profit but revenue is down • d2: Quorus narrows quarter loss but revenue decreases further • λ=1/2 • query: revenue down • ranking: d1>d2

Application • Community-based Question Answering (CQA) System: • Question Search. • Given a queried question, find a semantically equivalent question for the queried question. • General Search Engine • Given a query, rank documents.

Questions?

Reference • Multinomial distribution: • http://en.wikipedia.org/wiki/Multinomial_distribution • Likelihood function: • http://en.wikipedia.org/wiki/Likelihood • Maximum likelihood: • http://en.wikipedia.org/wiki/Maximum_likelihood

Language Model

Language Model

Presentation Transcript

Language Model in Turkish IR

Language Organization Classic model

Statistical Translation Language Model

Model/Language of Instruction

Cell Behavior Model Definition Language

SRILM Based Language Model

Language Planning: The Welsh Model

Conceptual Language Model Design for Spoken Language Understanding

Cummins Model of Academic Language

A Seamless Model and Language

Language Model

Language Organization Classic model

Language Model Grammar Conversion

Part 5 Language Model

Language model using HTK

Language Model (LM)

Language Model Methods and Metrics

RELATIONAL MODEL AND SQL LANGUAGE

Language Model

Language-Model Based Text-Compression

Language Model Grammar Conversion