Cluster-Based Retrieval Using Language Models

Cluster-Based Retrieval Using Language Models Xiaoyong Liu, W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts SIGIR’04

Abstract • It’s inconclusive that whether cluster-based retrieval does improve retrieval effectiveness over document-based retrieval. • We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. • We show that cluster-based retrieval can perform consistently across collections of realistic size. • Significant improvements over document-based retrieval can be obtained in a fully automatic manner.

Introduction (1/3) • Cluster hypothesis • Similar documents will match the same information needs • Document-based retrieval • The IR system matches the query against documents. • Cluster-based retrieval • Documents are grouped into clusters and the IR system returns a list of documents based on the clusters that they come from. • If the retrieval system were able to find good clusters, retrieval performance can be improved over document-based retrieval.

Introduction (2/3) • Static clustering • All documents in the collection are clustered, independent of the user’s query. • Query-specific clustering • The documents to be clustered are from the retrieval result of a document-based retrieval on the query. • Document clustering has been an important tool for Web search engines, for organizing and browsing.

Introduction (3/3) • There is no conclusive findings on whether document clustering can be used to improve retrieval results, especially on test collections of realistic size and without relevance information. • Language modeling approach • A theoretically attractive and potentially very effective probabilistic framework for studying IR problems.

Cluster-Based Retrieval (1/2) • Using clustering to filter non-relevant documents • Using clustering to identify a subset of documents that are likely to be relevant. • The most common approach • Ranking clusters • Using clusters as a form of document smoothing • Differences between representations of individual documents are smoothed out.

Cluster-Based Retrieval (2/2) • Static clustering • The potential of outperforming than document-based retrieval for precision-oriented searches • Query-specific clustering • Cluster hypothesis still holds. • To improve the ranking of relevant documents

Cluster-Based Language Models • (J. Allan, 1998) used Cluster-based language models in the research of Topic Detection and Tracking (TDT). • (W. Croft, 1999) used this for collection selection in distributed retrieval. • As a filtering tool • Limited smoothing

Language Models for IR • Building a language model D for each document in the collection. • Ranking the documents according to how likely the query Q to be generated by the document models, i.e. P(Q|D). • Assume the query terms are independent: λ is a parameter for smoothing • For different smoothing methods, λ takes different forms. • The query-likelihood (QL) model or the relevance model (RM)

Cluster-Based Retrieval Using Language Models • Building language models for clusters • CQL: • Using models of cluster to smooth documents • CBDM:

Clustering Algorithms • Using cosine measure as document similarity. • K-means for static clustering. • Five hierarchical agglomerative algorithms for query-specific clustering. • Single linkage • Complete linkage • Group average • Centroid • Wards’s method

Experimental Methods (1/2) • Data • Six data sets from TREC

Experimental Methods (2/2) • Parameter Selection • The AP collection is used as training collection. • Parameters of FR is tuned of its own. • Two sets of experiments • CQL for query-specific clustering (top 1000) • CBDM

Experiment Results of CQL for Query-Specific Clustering

Experiment Results of CBDM for Static Clustering • Selecting the suitable number of clusters

Experiment Results of CBDM for Query-Specific Clustering • CBDM with static clustering is more effective.  The first-stage retrieval results may be biased toward one particular interpretation of the query.

Conclusions • We propose two language models for cluster-based retrieval, one for ranking clusters and the other for using clusters to smooth documents. • We show that cluster-based retrieval is feasible in the language-modeling framework. • Cluster-based retrieval can be more effective than document-based retrieval. • Using clusters to smooth documents is generally more effective than directly ranking clusters.

Future Work • To investigate whether clusters generated on one collection can be used for other collections. • To investigate methods for automatic selection of model parameters, e.g. Gap statistics for estimating K of K-means.

Cluster-Based Retrieval Using Language Models

Cluster-Based Retrieval Using Language Models

Presentation Transcript

Shape-Based Retrieval of Articulated 3D Models Using Spectral Embedding

Language Models for Information Retrieval

Semantic-based Language Models for Text Retrieval and Clustering

: Retrieval Models

Cumulative Progress in Language Models for Information Retrieval

Information Retrieval – Language models for IR

Search Strategies based on cluster-based indexing and retrieval sophiasearch

Retrieval Models I

Probabilistic Language-Model Based Document Retrieval

Cluster Language Models

Information Retrieval Models

Two-stage Language Models for Information Retrieval

Analyzing Retrieval Models using Retrievability Measurement

Language and Document Models in Information Retrieval

Information Retrieval Models

Retrieval Models

Relevance-Based Language Models

Content-based Retrieval of 3D Models using Generative Modeling Techniques

Probabilistic Language-Model Based Document Retrieval