Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information ScienceUniversity of DelawareNewark, DE ( CIKM ’09 ) Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling

Agenda • Introduction- Motivation, Goal • Faceted Topic Retrieval - Task, Evaluation • Faceted Topic Retrieval Models- 4 kinds of models • Experiment & Results • Conclusion

Introduction - Motivation • Modeling documents as independently relevant does not necessarily provide the optimal user experience.

Introduction - Motivation Actually, we prefer System2 (since it has more information) Traditional evaluation measure would reward System1 since it has higher recall System2 is better !

Introduction • Novelty and diversity become the new definition of relevance and evaluation measures . • They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic. we call faceted topic retrieval !

Introduction - Goal • The faceted topic retrieval system must be able to find a small set of documents that covers all of the facets • 3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

Faceted Topic Retrieval - Task Define the task in terms of • Information need : A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated • How that need is best satisfied : Each answer is fully contained within at least one document

Faceted Topic Retrieval - Task Information need Facets (a set of answers) shift to coal invest in next generation technologies shift to biodiesel Invest in renewable energy sources increase use of renewable energy sources double ethanol in gas supply

Faceted Topic Retrieval A Query : A sort list of keywords D1 D2 Our System Dn A ranked list of documents that contain as many unique facets as possible.

Faceted Topic Retrieval -Evaluation • S-recall • S-precision • Redundancy

Evaluation – an example for S-recall and S-precision • Total : 10 facets (assume all facets in documents are non-overlapped)

Evaluation – an example for Redundancy

Faceted topic retrieval models • 4 kinds of models- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach

1. MMR 2. Probabilistic Interpretation of MMR Let c1=0, c3=c4

3. Greedy Result Set Pruning • First, rank without considering novelty (in order of relevance) • Second, step down the list of documents, prunedocuments with similarity greater than some threshold ϴ  I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ

4. A Probabilistic Set-Based Approach • P(F ϵ D) :Probability of D contains F • the probability that a facet Fj occurs in at least one document in a set D is • the probability that all of the facets in a set F are captured by the documents D is

4. A Probabilistic Set-Based Approach • 4.1 Hypothesizing Facets • 4.2 Estimating Document-Facet Probabilities • 4.3 Maximizing Likelihood

4.1 Hypothesizing Facets Two unsupervised probabilistic methods : • Relevance modeling • Topic modeling with LDA Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)

4.1 Hypothesizing Facets • Since we do not know the facet terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents • Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

Relevance modeling • Estimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach: DFj: the set of documents relevant to facet Fj fk : facet terms

Topic modeling with LDA • Probabilistic P(w|Fj) and P(Fj) can found through expectation maximization

4.2 Estimating Document-Facet Probabilities • Both the facet relevance model and LDA model produce generation probabilistic P(Di|Fj) • P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di

4.3 Maximizing Likelihood • Define the likelihood function • Constrain : • K : hypothesized minimum number required to cover the facets Maximizing L(y) is a NP-Hard problem • Approximate solution : • For each facet Fj, take the document Diwith maximum

Experiment - Data A Query : A sort list of keywords D1 D2 Query Likelihood L.M. TDT5 Corpus (278,109 docs) D130 Top 130 retrieved documents

Experiment - Data For 60 queries : D1 44.7 relevant documents per query Each document contains 4.3 facets 39.2 unique facets on average ( for average one unique facet per relevant document ) Agreement :72% of all relevant documents were judged relevant by both assessors D2 2 assessors to judge D130 Top 130 retrieved documents

Experiment - Data • TDT5 sample topic definition Query Judgments

Experiment – Retrieval Engines Using Lemur toolkit • LM baseline: a query-likelihood language model • RM baseline: a pseudo-feedback with relevance model • MMR: query similarity scores from LM baseline and cosine similarity for novelty • AvgMix (Prob MMR) : the probabilistic MMR model using query-likelihood scores from LM baseline and the AvgMix novelty score. • Pruning: removing documents from the LM baseline on cosine similarity • FM: the set-based facet model

Experiment – Retrieval Engines • FM: the set-based facet model • FM-RM: each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj) • FM-LDA: use LDA to discover subtopics zj, and get P(zj|D), we extract 50 subtopics

Experiments - Evaluation • Use five-fold cross-validation to train and test systems • 48 queries in four folds to train model parameters • Parameters are used to obtain ranked results on the remaining 12 queries • At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

Results

Conclusion • We defined a type of novelty retrieval task called faceted topic retrieval  retrieve the facets of information need in a small set of documents. • We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models. • Both models are competitive with MMR, and outperform another probabilistic model.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Presentation Transcript

Probabilistic Topic Models and Associative Memory

Probabilistic models

Probabilistic Models

Probabilistic Models

Probabilistic Language-Model Based Document Retrieval

Semantic Representations with Probabilistic Topic Models

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Probabilistic Information Retrieval

Document Retrieval Problems

Probabilistic Topic Models for Text Mining

Language and Document Models in Information Retrieval

Document retrieval

Probabilistic Models

Probabilistic Models

Probabilistic Topic Models

Contextual Text Mining with Probabilistic Topic Models

Probabilistic Models

Probabilistic Models

Contextual Text Mining with Probabilistic Topic Models

Probabilistic models

Probabilistic Language-Model Based Document Retrieval

Probabilistic Topic Models