Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005

Learning to Estimate Query DifficultyIncluding Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005

Abstract • Novel learning methods are used for estimating the quality of results returned by a search engine in response to a query. • Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. • Quality estimation are useful for several applications, including improvement of retrieval, detecting queries for whichno relevant content exists in the document collection, and distributed information retrieval.

Introduction (1/2) • Many IR systems suffer from a radical variance in performance. • Estimating query difficulty is an attempt to quantify the quality of results returned by a given system for the query. • Reasons for query difficulty estimation • Feedback to the user • The user can rephrase “difficult” queries. • Feedback to the search engine • To invoke alternative strategies for different queries • Feedback to the system administrator • To identify queries related to a specific subject, and expand the document collection. • For distributed information retrieval

Introduction (2/2) • The observation and motivation: • queries answered well are those whose query terms agree on most of the returned documents. • Agreement is measured by the overlap between the top results. • Difficult queries are those: • The query terms cannot agree on top results. • Most of the terms do agree except a few outliers (局外人). • A TREC query for example: “What impact has the chunnel (水底隧道) had on the British economy and/or the life style of the British”

Related Work (1/2) • In the Robusttrackof TREC 2004, systems are asked to rank the topics by predicted difficulty. • The goal is eventually to use such predictions to do topic-specific processing. • Prediction methods suggested by the participants: • Measuring clarity based on the system’s score of the top results • Analyzing the ambiguity of the query terms • Learning a predictor using old TREC topics as training data • (Ounis, 2004) showed that IDF-based predictor is positively related to query precision. • (Diaz, 2004) used temporal distribution together with content of the documents to improve the prediction of AP for a query.

Related Work (2/2) • The Reliable Information Access (RIA) workshop investigated the reasons for system performance variance across queries. • 10 failure categories were identified. • 4 of which are due to emphasizing only partial aspects of the query. • One of the conclusions of this workshop: “…comparing a full topic ranking against ranking based on only one aspect of the topic will give a measure of the importance of that aspect to the retrieved set”

Estimating Query Difficulty • Query terms are defined as the keywords and the lexical affinities. • Features used for learning: • The overlap between each sub-query and the full query • Measured by κ-statistics • The rounded logarithm of the document frequency, log(DF), of each of the sub-queries. • Two challenges for learning: • The number of sub-queries is not constant. • A canonic representation is needed. • The sub-queries are not ordered.

Query Estimator Using a Histogram (1/2) • The basic procedure: • Find the top N results for the full query and for each sub-query. • Build a histogram of the overlaps h(i,j) to form a feature vector. • Values of log(DF) are split into 3 discrete values {0－1, 2－3, 4＋}. • h(i,j) means log(DF)＝i & overlaps＝j. • The rows of h(i,j) are concatenated as a feature vector. • Compute the linear weight vector c for prediction. • An example, suppose a query has 4 sub-queries: log(DF(n))＝[0 1 1 2], overlap＝[2 0 0 1] → h(i)＝[0 0 1 2 0 0 0 1 0]

Query Estimator Using a Histogram (2/2) • Two additional features • The score of the top-ranked document • The number of words in the query • Estimate the linear weight vector c (Moore-Penrose pseudo-inverse): c = (H．HT)-1．H．tT H＝the matrix with columns are feature vectors of training queries t＝a vector of the target measure (P@10 or MAP) of training queries (H and t can be modified according to the objective)

Query Estimator Using a Modified Decision Tree (1/2) • Useful for sparseness, i.e. queries are too short. • A binary decision tree • Pairs of overlap and log(DF) of sub-queries form features. • Each node consists of a weight vector, threshold, and score. • An example:

Query Estimator Using a Modified Decision Tree (2/2) • The concept of Random Forest • Better decision trees can be obtained by training a multitude of trees, each in a slightly different manner or using different data. • Apply AdaBoost algo. to resample training data

Experiment and Evaluation (1/2) • The IR system is Juru. • Two document collections • TREC-8: 528,155 documents, 200 topics • WT10G: 1,692,096 documents, 100 topics • Four-fold cross-validation, • Measured by Kendall’s-τcoefficient

Experiment and Evaluation (2/2) • Compared with some other algorithms • Estimation based on the score of the top result • Estimation based on the average score of the top ten results • Estimation based on the standard deviation of IDF values of query terms • Estimation based on learning a SVM for regression

Application 1: Improving IR Using Query Estimation (1/2) • Selective automatic query expansion • Adding terms to the query based on frequently appearing terms in the top retrieved documents • Only works for easy queries • Using the same features to train a SVM classifier • Deciding which part of the topic should be used • TREC topics contain two parts: short title and longer description • Some topics that are not answered well by the description part are better answered by the title part. • Difficult topics use title part and easy topics use description.

Application 1: Improving IR Using Query Estimation (2/2)

Application 2: Detecting Missing Content (1/2) • Missing content queries (MCQs) are those have no relevant document in the collection. • Experiment method • 166 MCQs are created artificially from 400 TREC queries • 200 TREC topics consist of title and description. • Ten-fold cross-validation • A tree-based classifier is trained to classify MCQs from non-MCQs. • A query difficulty estimator may or may not be used as a pre-filter of easy queries before the MCQ classifier.

Application 2: Detecting Missing Content (2/2)

Application 3: Merging the Results of Distributed Retrieval (1/2) • It is difficult to rerank the documents from different datasets since the scores are local for each specific dataset. • CORI (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference network to do collection ranking. • Apply the estimator to this problem: • A query estimator is trained for each dataset. • The estimated difficulty is used for weighting the scores. • These weighted scores are merged to built the final ranking. • Ten-fold cross-validation • Only minimal information is supplied by the search engine.

Application 3: Merging the Results of Distributed Retrieval (2/2) • Selective weighting • All queries are clustered (2-means) based on their estimations for each of the datasets. • In one cluster, the variance of the estimations is small → unweighted scores are better for queries in this cluster. • The estimations of difficulty become noise when there is little variance.

Conclusions and Future Work • Two methods for learning an estimator of query difficulty are described. • The learned estimator predicts the expected precision of the query by analyzing the overlap between the results of the full query and the results of its sub-queries. • We show that such an estimator can be used to several applications. • Our results show that the quality of query prediction strongly depends on the query length. • One of the future work is to look for additional features not depend on the query length. • Whether more training data can be accumulated in automatic or semi-automatic manner is left for future research.

Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005