230 likes | 339 Views
Language Models for Information Retrieval. Andy Luong and Nikita Sudan. Outline. Language Model Types of Language Models Query Likelihood Model Smoothing Evaluation Comparison with other approaches. Language Model.
E N D
Language Models for Information Retrieval Andy Luong and Nikita Sudan
Outline • Language Model • Types of Language Models • Query Likelihood Model • Smoothing • Evaluation • Comparison with other approaches
Language Model • A language model is a function that puts a probability measure over strings drawn from some vocabulary.
Language Models P(q|Md) instead of P(R=1|q,d)
Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/6 1/3
Example Continued q = “frog likes toad” P(q | M1) = (1/3)*(1/6)*(1/6)*0.8*0.8*0.2 P(q | M2) = (1/3)*(1/3)*(1/3)*0.8*0.8*0.2 P(q | M1) < P (S | M2)
Types of Language Models CHAIN RULE UNIGRAM LM BIGRAM LM
Multinomial distribution Frequency Order Constraint M is the size of the term vocabulary
Query Likelihood Model • Infer LM for each document • Estimate P(q | Md(i)) • Rank documents based on probabilities
Smoothing • Basic Intuition • New word or unseen word in the document • P( t | Md) = 0 • Zero probabilities will make P ( q | Md) = 0 • Why else should we smooth?
Smoothing Continued Non-occurring term Probability Bound Linear Interpolation Language Model
Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/9 1/9 2/9 2/9
Example Continued q= “frog said” λ = ½ P(q | M1) = [(1/3 + 1/3)*(1/2)] * [(1/6 + 1/9)*(1/2)] = .046 P(q | M2) = [(1/3 + 1/3)*(1/2)] * [(0 + 1/9)*(1/2)] = .018 P(q | M1) > P (q | M2)
Evaluation • Precision = (relevant documents ∩ retrieved documents)/ retrieved documents • Recall = (relevant documents ∩ retrieved documents)/ relevant documents
Tf-Idf • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Pros and Cons • “Mathematically precise, conceptually simple, computationally tractable and intuitively appealing.” • Relevancy is not captured
Query vs. Document Model (a) Query Likelihood (b) Document Likelihood (c) Model Comparison