390 likes | 522 Views
A Novel TF-IDF Weighting Scheme for Effective Ranking. Jiaul H. Paik Indian Statistical Institute, Kolkata, India jia.paik@gmail.com SIGIR’2013 Presenter: SHIH,KAI-WUN. Outline. Introduction State Of The Art Proposed Work Experimental Setup Results Conclusion. Introduction(1/4).
E N D
A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India jia.paik@gmail.com SIGIR’2013 Presenter: SHIH,KAI-WUN
Outline • Introduction • State Of The Art • Proposed Work • Experimental Setup • Results • Conclusion
Introduction(1/4) • Almost all retrieval models integrate three major variables to determine the degree of importance of a term for a document: (i) within document term frequency, (ii) document length and (iii) the specificity of the term in the collection. • Retrieval models can be broadly classified into two major families based on their term weight estimation principle . • Vector space model • Probabilistic model
Introduction(2/4) • Most of the existing models (possibly all) employ a single term frequency normalization mechanism that does not take into account various aspects of a term’s saliency in a document. • Another major limitation of the present models is that they do not balance well in preferring short and long documents.
Introduction(3/4) • This article proposes a term weighting scheme that can address these weaknesses in an effective way. • It introduces a two aspect term frequency normalization scheme, that combines relative tf weighting and the tf normalization based on document length. • It uses the query length information to emphasize the appropriate component. • It modifies the usual term discrimination measure (idf) by integrating mean term frequency of a term in the set of documents the term is contained in. • Finally, we use asymptotically bounded function to transform the tf factors that better handles the term coverage issues in the documents and also helps to combine the two tf factors more easily.
Introduction(4/4) • The experimental results show that the proposed weighting function consistently and significantly outperforms five state of the art retrieval models measured in terms of the standard metrics such as MAP and NDCG. • The experimental outcomes also attest that the model achieves significantly better precision than all the other models when measured in terms of a recently popularized metric, namely, expected reciprocal rank (ERR).
State Of The Art(1/2) • Three widely used models in IR are the vector space model the probabilistic models and the inference network based model. • In vector space model, queries and documents are represented as the vector of terms. To compute a score between a document and a query the model measures the similarity between the query and document vector using cosine function. • Three main factors that come into play to compute the weight of a term are: (i) frequency of the term in the document, (ii) document frequency of the term in the collection and (iii) the length of the document that contains the term.
State Of The Art(2/2) • The key part of the probabilistic models is to estimate the probability of relevance of the documents for a query. • Probabilistic language modeling technique is another effective ranking model that is widely used today. • However, smoothing is crucial and it has very similar effect as the parameter that controls the length normalization components in pivoted normalization. • Relatively recent, another probabilistic model is proposed in [3] that computes the weight of a term by measuring the informative content of a term by computing the amount of divergence of the term frequency distribution from the distribution based on a random process.
Proposed Work -Preliminaries(1/3) • Before we give the main motivation behind our work, let us first revisit the three key hypotheses. • Term Frequency Hypothesis (TFH): The weight of term in a document should increase with the increase in term frequency (TF). However, it seems unlikely that the importance of a term grows linearly with the increase in TF. Therefore, researchers have used dampened TF instead of the raw TF for ranking.
Proposed Work -Preliminaries(2/3) • Advanced TF Hypothesis (AD-TFH): The modified term frequency hypothesis captures the intuition that the rate of change of weight of a term should decrease with the larger TF. Formally, we hypothesize that, a function Ft(TF), that maps the original TF to the resultant value (which will be used for final weighting), should possess the following two properties. (a) F’t (TF) > 0 (b) F’’t (TF) < 0
Proposed Work -Preliminaries(3/3) • Document Length Hypothesis (DLH): This hypothesis captures the relationship between the term frequency and the document length. Long documents tend to use a term repeatedly, thus term frequency can be higher in a long document. Thus, it is necessary to regulate the TF value in accordance with the document length. • Term Discrimination Hypothesis (TDH): If a query contains more than one term, then a good weighting scheme should prefer a document that contains the rare term (in the collection).
Proposed Work -Two Aspects of TF(1/4) • we consider the following two aspects: • Relative Intra-document TF (RITF) : In this factor, the importance of a term is measured by considering its frequency relative to the average TF of the document. Thus, a natural formulation for this could be where TF(t,D) and Avg.TF(t,D) denote the frequency of the term t in D and average term frequency of D respectively.
Proposed Work -Two Aspects of TF(2/4) • However, Equation 2 may too much prefer excessively long documents, since the denominator is close to 1 for a very long document . Hence, we use the following function.
Proposed Work -Two Aspects of TF(3/4) • Relative Intra-document TF (RITF) : We assume that the appropriate length of a document should be the average document length of the collection and the frequency of the terms of an average length document should remain unchanged. Thus, a reasonable starting point could be , where ADL(C) is the average document length of the collection and len(D) is the length of the document D.
Proposed Work -Two Aspects of TF(4/4) • But once again, it seems unlikely that the increase in term frequency follows a linear relationship with the document length, and thus the above formula over-penalizes the long documents. To overcome this bias, we employ the following function to achieve the length dependent normalization. Equation 4 still punishes the long documents but with diminishing effect.
Proposed Work -Motivation(1/3) • In order to motivate the use of two TF factors, let us consider the following two somewhat toy examples. Example 1 Let D1 and D2 be two documents of equal lengths, with the following statistics. • len(D1) =20, # distinct term of D1 = 5, TF(t,D1)=4 • len(D2) =20, # distinct term of D2 = 15, TF(t,D2)=4 In both of the cases, LRTF considers t equally important. A little thought will convince us that this is not appropriate, since the focus of the document D1 seems to be divided equally among 5 terms and therefore t should not be considered salient, while t seems to be important for D2. • Thus, in the later case RITF seems to be a better choice to LRTF.
Proposed Work -Motivation(2/3) • Let us now turn to the other direction and consider the second example. Example 2 Let D1 and D2 be two documents with the following statistics. • len(D1) =20, # distinct term of D1 = 15, TF(t,D1)=4 • len(D2) = 200, # distinct term of D2 = 150, TF(t,D2)=4 For this instance however, RITF considers the term t equally important for both D1 and D2, which is not right, since D2 contains more distinct terms and thus seems to cover many other topics (also possibly uses t repeatedly). • Therefore, in this case, the use of LRTF seems to be a potential choice over RITF.
Proposed Work -Motivation(3/3) • Motivated by the above examples, our main goal now is to integrate the above two factors into our weighting scheme. • However, we do not use the TF factors as defined in the Equations 3 and 4. We transform these TF values for our final use that in some sense makes use of the hypothesis ADTFH.
Proposed Work -Transforming TF Factors • We transform the TF factors using a function f(x) that possesses the following properties: (i) vanishes at 0, (ii) satisfies the two conditions of AD-TFH hypothesis (f’(x) > 0 and f’’(x) < 0), and (iii) asymptotically upper bounded to 1. • One of the simplest functions that satisfies the above properties is . • Using this function, we now transform the two TF factors as follows:
Proposed Work -Combining Two TF Factors(1/5) • Now the key question that we face: how should we combine BRITF(t,D) and BLRTF(t,D)? A natural way to do this is as follows: where 0 < w < 1. • The next important issues that arise out of Equation 7 are the following: • Should we prefer BRITF(t,D) (w > 0.5)? • Should we prefer BLRTF(t,D) (w < 0.5)?
Proposed Work -Combining Two TF Factors(2/5) • In order to answer these questions, we now analyze the properties of the two TF components. From Equation 5, it is clear that BRITF(t,D) has a tendency to prefer long documents, since for long documents the denominator part of RITF(t,D) is close to 1, and TF is usually larger. On the other hand, BLRTF(t,D) tends to prefer short documents, since LRTF(t,D) → 0 as len(D)→∞. Therefore, when a query is long, BRITF(t,D) heavily prefers extremely long documents, since the number of matches is more or less proportional to the length of the document.
Proposed Work -Combining Two TF Factors(3/5) • On the contrary, since BLRTF(t,D) prefers short documents it can penalize extremely long documents when it faces longer queries, and thus it is preferable when longer queries are encountered. Another interesting property of BRITF(t,D) is that it emphasizes on the number of matches, since the main component of this formula RITF(t,D) heavily punishes the term frequency, and thus important for the short queries. • Hence, the foregoing discussion suggests that, for short queries BRITF(t,D) should be preferred, while for longer queries, BLRTF(t,D) should be given more weight.
Proposed Work -Combining Two TF Factors(4/5) • The value of w should decrease with the increase in query length, while it must lie between [0-1]. • Specifically, we characterize the query length factor (QLF(Q)) by the following variables. (i) QLF(Q) = 1 for |Q| = 1, (ii) QLF(Q) < 0 and (iii) 0 < QLF(Q) < 1.
Proposed Work -Combining Two TF Factors(5/5) • Numerous different functions can be constructed that satisfy the above conditions. We used the following three different functions. • The first function descends more rapidly than the second function, while the second function descends more rapidly than the third function.
Proposed Work -Term Discrimination Factor(1/2) • Inverse document frequency (IDF) is a well known measure that serves the above purpose. We use the following standard idf measure. • We hypothesize that the term discrimination is a combination of the above two factors. • In particular, we hypothesize that if two terms have equal document frequencies, then the term discrimination should increase with the increase in average elite set term frequency. • The average elite set term frequency (AEF) is defined as , where CTF(t, C) denotes the total occurrence of the term t in the entire collection.
Proposed Work -Term Discrimination Factor(2/2) • However, the combination of raw AEF with IDF may disturb the overall term discrimination value, since the IDF values are obtained by dampening through log function. • Hence, we employ a slowly increasing function to transform the AEF values for this combination. • Once again, we use the function f(x) = x/(1+x) to transform the AEF values for the final use. The final term discrimination value of term t is computed as
Proposed Work -Final Formula • Integrating the above factors we now obtain the following final scoring formula. • Again, since TFF(qi,D) < 1, we obtain the following relationship. • Therefore, we can easily modify Equation 13 to get the normalized similarity scores (0 < Sim(Q,D) < 1) as follows:
Experimental Setup - Data • Table 1 summarizes the statistics on test collections used in our experiments. The experiments are conducted on a large number of standard test collections, that vary both by type, the size of the document collections and the number of queries.
Experimental Setup - Evaluation Measures and IR System(1/2) • All our experiments are carried out using TERRIER1 retrieval system (version 3.5). • We use title field of the topics (note that two million query data contain more than 1000 queries that contain more than 5 terms). • From all the collections we removed stopwords during indexing. Documents and queries are stemmed using Porter stemmer.
Experimental Setup - Evaluation Measures and IR System(2/2) • We use the following metrics to evaluate the systems. • Mean Average Precision (MAP) • Normalized DCG at k (NDCG@k) • Expected Reciprocal Rank at k (ERR@k): ERR@k is computed as follows: where R(g) = , hg is the highest grade and g1, g2 . . . gkare the relevance grades associated with the top k documents. • The first two metrics are used to reflect the overall performance of the systems, while the last evaluation measure reflects better the precision of search results.
Experimental Setup – Baselines(1/3) • The details of the baselines are given below. • Pivoted length normalized TF-IDF model: This model is one of the best performing TF-IDF formula in the vector space model framework. The value of the parameter s is set to 0.05. • Lemur TF-IDF model: This model is another TF-IDF model that uses Robertson’s tf and the standard idf. The parameter of this model is set to its default value, 0.75.
Experimental Setup – Baselines(2/3) • Classical Probabilistic model (BM25): The main differences with this model and the previous model are that BM25 uses query term frequency in a different way and the idfalso differs with the standard one. The parameters of this model is set to k1 = 1.2, b = 0.6 and k3 = 1000. • Dirichlet smooth language model (LM): Language model is another probabilistic model that performs very effectively. For this model we set the value of Dirichlet smoothing (μ) to 1700.
Experimental Setup – Baselines(3/3) • Divergence from Randomness model (PL2): Finally, PL2 represents the recently proposed non-parametric probabilistic model from divergence from randomness(DFR) family. Similar to the previous models, its performance also depends on a parameter value (c in the formula). We conduct experiments for this model by setting c = 13.
Results - Comparison with TF-IDF Models(1/2) • In this section we focus on to compare the performance of the proposed model (MATF) with the Lemur TF-IDF and and Pivoted TF-IDF models.
Results - Comparison with TF-IDF Models(2/2) • The performances of MATF and the two TF-IDF models on two million query data, measured by statistical average precision, are shown in Table 3.
Results - Comparison with Probabilistic Models(1/2) • In the last section we compare the performance of our model with two TF-IDF models. In this section we compare the performance of MATF with three well known state of the art probabilistic retrieval models.
Results - Comparison with Probabilistic Models(2/2) • Table 5 compares the performance of four models for million query collections measured in terms of statistical average precision.
Results - Analysis • Table 6 presents the experimental results on two million query data. The results seem to confirm our aforesaid assumption. LRTF always performs better than RITF on both of the collections, while RITF does better for short queries.
Conclusion • In this paper, we present a novel TF-IDF term weighting scheme. • Experiments carried out on a set of news and web collections show that the proposed model outperforms two well known state of the art TF-IDF baselines with significantly large margin, when measured in terms of MAP and NDCG. • Moreover, the proposed model is also significantly better than all of the five baselines in improving precision.