490 likes | 639 Views
Empirical Development of an Exponential Probabilistic Model. Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI), MIT. Goal: Better Generative Model. Generative v. discriminative model Applies to many applications Information retrieval (IR)
E N D
Empirical Development of anExponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI), MIT
Goal: Better Generative Model • Generative v. discriminative model • Applies to many applications • Information retrieval (IR) • Relevance feedback • Using unlabeled data • Classification • Assumptions explicit
Using a Model for IR Hyper-learn • Define model • Learn parameters from query • Rank documents • Better model improves applications • Trickle down to improve retrieval • Classification, relevance feedback, … • Corpus specific models
Overview • Related work • Probabilistic models • Example: Poisson Model • Compare model to text • Hyper-learning the model • Exponential framework • Investigate retrieval performance • Conclusion and future work
Related Work • Using text for retrieval algorithm • [Jones, 1972], [Greiff, 1998] • Using text to model text • [Church & Gale, 1995], [Katz, 1996] • Learning model parameters • [Zhai & Lafferty, 2002] Hyper-learn the model from text!
Probabilistic Models • Rank documents by RV =Pr(rel|d) • Naïve Bayesian models RV =Pr(rel|d)
Probabilistic Models • Rank documents by RV =Pr(rel|d) • Naïve Bayesian models # occs in doc = Pr(dt|rel) features t RV =Pr(rel|d) Pr(d|rel) 8 words • Open assumptions • Feature definition • Feature distribution family Defines the model!
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents Pr(dt|rel) =
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • Poisson Model • θ: specifies term distribution dt -θ θ e Pr(dt|rel) = dt!
Example Poisson Distribution + θ=0.0006 Pr(dt|rel) Pr(dt|rel)≈1E-15 Term occurs exactlydt times
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • Learn a θ for each term • Maximum likelihood θ • Term’s average number of occurrence • Incorporate prior expectations
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents • For each document, find RV • Sort documents by RV = Pr(dt|rel). words t RV
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents Which step goes wrong? • For each document, find RV • Sort documents by RV = Pr(dt|rel). words t RV
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents
Using a Naïve Bayesian Model • Define model • Learn parameters from query • Rank documents dt -θ θ e Pr(dt|rel) = dt!
How Good is the Model? + θ=0.0006 Pr(dt|rel) 15 times Term occurs exactlydt times
How Good is the Model? + θ=0.0006 Pr(dt|rel) Misfit! 15 times Term occurs exactlydt times
Hyper-learning a Better FitThrough Textual Analysis Using an Exponential Framework
Hyper-Learning Framework • Need framework for hyper-learning Mixtures Poisson Bernoulli Normal
Hyper-Learning Framework • Need framework for hyper-learning • Goal: Same benefits as Poisson Model • One parameter • Easy to work with (e.g., prior) Mixtures Poisson Bernoulli Normal One parameter exponential families
Exponential Framework • Well understood, learning easy • [Bernardo & Smith, 1994], [Gous, 1998] Pr(dt|rel) = f(dt)g(θ)e • Functions f(dt) and h(dt) specify family • E.g., Poisson: f(dt) = (dt!)-1,h(dt) = dt • Parameter θ term’s specific distribution θh(dt)
Using a Hyper-learned Model • Define model • Learn parameters from query • Rank documents
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents • Want “best” f(dt) and h(dt) • Iterative hill climbing • Local maximum • Poisson starting point
Using a Hyper-learned Model • Hyper-learn model • Learn parameters from query • Rank documents • Data: TREC query result sets • Past queries to learn about future queries • Hyper-learn and test with different sets
Recall the Poisson Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Poisson Starting Point - h(dt) + h(dt) Pr(dt|rel) =f(dt)g(θ)e θh(dt) dt
Hyper-learned Model - h(dt) Hyper-learned Model - h(dt) + h(dt) Pr(dt|rel) =f(dt)g(θ)e θh(dt) dt
Poisson Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 15 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 5 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 30 times Term occurs exactlydt times
Hyper-learned Distribution Hyper-learned Distribution + Pr(dt|rel) 300 times Term occurs exactlydt times
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents
Performing Retrieval Labeled docs • Hyper-learn model • Learn parameters from query • Rank documents θh(dt) Pr(dt|rel) = f(dt)g(θ)e • Learn θ for each term
Learning θ • Sufficient statistics • Summarize all observed data • τ1: # of observations • τ2: Σobservations d h(dt) • Incorporating prior easy • Map τ1 and τ2θ 20 labeled documents
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents
Results: Labeled Documents Results: Labeled Documents Precision Recall
Results: Labeled Documents Results: Labeled Documents Precision Recall
Performing Retrieval • Hyper-learn model • Learn parameters from query • Rank documents Short query
Retrieval: Query Retrieval: Query • Query = single labeled document • Vector space-like equation RV = Σa(t,d) + Σb(q,d) • Problem: Document dominates • Solution: Use only query portion • Another solution: Normalize t in doc q in query
Retrieval: Query Precision Recall
Retrieval: Query Precision Recall
Retrieval: Query Precision Recall
Conclusion • Probabilistic models • Example: Poisson Model • Hyper-learning the model • Exponential framework • Learned a better model • Investigate retrieval performance - Bad text model - Easy to work with - Heavy tailed! - Better …
Future Work • Use model better • Use for other applications • Other IR applications • Classification • Correct for document length • Hyper-learn on different corpora • Test if learned model generalizes • Different for genre? Language? People? • Hyper-learn model better
Questions? Contact us with questions: Jaime Teevan teevan@ai.mit.edu David Karger karger@theory.lcs.mit.edu