180 likes | 363 Views
Cluster-Based Retrieval Using Language Models. Xiaoyong Liu, W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts SIGIR ’ 04. Abstract.
E N D
Cluster-Based Retrieval Using Language Models Xiaoyong Liu, W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts SIGIR’04
Abstract • It’s inconclusive that whether cluster-based retrieval does improve retrieval effectiveness over document-based retrieval. • We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. • We show that cluster-based retrieval can perform consistently across collections of realistic size. • Significant improvements over document-based retrieval can be obtained in a fully automatic manner.
Introduction (1/3) • Cluster hypothesis • Similar documents will match the same information needs • Document-based retrieval • The IR system matches the query against documents. • Cluster-based retrieval • Documents are grouped into clusters and the IR system returns a list of documents based on the clusters that they come from. • If the retrieval system were able to find good clusters, retrieval performance can be improved over document-based retrieval.
Introduction (2/3) • Static clustering • All documents in the collection are clustered, independent of the user’s query. • Query-specific clustering • The documents to be clustered are from the retrieval result of a document-based retrieval on the query. • Document clustering has been an important tool for Web search engines, for organizing and browsing.
Introduction (3/3) • There is no conclusive findings on whether document clustering can be used to improve retrieval results, especially on test collections of realistic size and without relevance information. • Language modeling approach • A theoretically attractive and potentially very effective probabilistic framework for studying IR problems.
Cluster-Based Retrieval (1/2) • Using clustering to filter non-relevant documents • Using clustering to identify a subset of documents that are likely to be relevant. • The most common approach • Ranking clusters • Using clusters as a form of document smoothing • Differences between representations of individual documents are smoothed out.
Cluster-Based Retrieval (2/2) • Static clustering • The potential of outperforming than document-based retrieval for precision-oriented searches • Query-specific clustering • Cluster hypothesis still holds. • To improve the ranking of relevant documents
Cluster-Based Language Models • (J. Allan, 1998) used Cluster-based language models in the research of Topic Detection and Tracking (TDT). • (W. Croft, 1999) used this for collection selection in distributed retrieval. • As a filtering tool • Limited smoothing
Language Models for IR • Building a language model D for each document in the collection. • Ranking the documents according to how likely the query Q to be generated by the document models, i.e. P(Q|D). • Assume the query terms are independent: λ is a parameter for smoothing • For different smoothing methods, λ takes different forms. • The query-likelihood (QL) model or the relevance model (RM)
Cluster-Based Retrieval Using Language Models • Building language models for clusters • CQL: • Using models of cluster to smooth documents • CBDM:
Clustering Algorithms • Using cosine measure as document similarity. • K-means for static clustering. • Five hierarchical agglomerative algorithms for query-specific clustering. • Single linkage • Complete linkage • Group average • Centroid • Wards’s method
Experimental Methods (1/2) • Data • Six data sets from TREC
Experimental Methods (2/2) • Parameter Selection • The AP collection is used as training collection. • Parameters of FR is tuned of its own. • Two sets of experiments • CQL for query-specific clustering (top 1000) • CBDM
Experiment Results of CBDM for Static Clustering • Selecting the suitable number of clusters
Experiment Results of CBDM for Query-Specific Clustering • CBDM with static clustering is more effective. The first-stage retrieval results may be biased toward one particular interpretation of the query.
Conclusions • We propose two language models for cluster-based retrieval, one for ranking clusters and the other for using clusters to smooth documents. • We show that cluster-based retrieval is feasible in the language-modeling framework. • Cluster-based retrieval can be more effective than document-based retrieval. • Using clusters to smooth documents is generally more effective than directly ranking clusters.
Future Work • To investigate whether clusters generated on one collection can be used for other collections. • To investigate methods for automatic selection of model parameters, e.g. Gap statistics for estimating K of K-means.