610 likes | 805 Views
Advanced Information- Retrieval Models. Hsin-Hsi Chen. Mathematical Models for IR. Mathematical Models for Information Retrieval Boolean Model Compare Boolean query statements with the term sets used to identify document content.
E N D
Advanced Information-Retrieval Models Hsin-Hsi Chen
Mathematical Models for IR • Mathematical Models for Information Retrieval • Boolean ModelCompare Boolean query statements with the term sets used to identify document content. • probabilistic modelCompute the relevance probabilities for the documents of a collection. • vector-space modelRepresent both queries and documents by term sets, and compute global similarities between queries and documents.
Basic Vector Space Model • Term vector representation of documents Di=(ai1, ai2, …, ait) queries Qj=(qj1, qj2, …, qjt) • t distinct terms are used to characterize content. • Each term is identified with a term vector T. • T vectors are linearly independent. • Any vector is represented as a linear combination of the T term vectors. • The rth document Dr can be represented as a document vector, written as
Document representation in vector space a document vector in a two-dimensional vector space
Similarity Measure • measure by product of two vectorsx • y = |x| |y| cos • document-query similarity • how to determine the vector components and term correlations? document vector: term vector:
Similarity Measure (Continued) • vector components
Similarity Measure (Continued) • term correlations Ti • Tj are not availableassumption: term vectors are orthogonalTi • Tj =0 (ij) Ti • Tj =1 (i=j) • Assume that terms are uncorrelated. • Similarity measurement between documents
Sample query-documentsimilarity computation • D1=2T1+3T2+5T3D2=3T1+7T2+1T3Q=0T1+0T2+2T3 • similarity computations for uncorrelated termssim(D1,Q)=2•0+3 •0+5 •2=10sim(D2,Q)=3•0+7 •0+1 •2=2
Sample query-documentsimilarity computation (Continued) • T1T2T3T1 1 0.5 0T2 0.5 1 -0.2T3 0 -0.2 1 • similarity computations for correlated termssim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 ) =4T1•T3+6T2 •T3 +10T3 •T3 =-6*0.2+10*1=8.8sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 ) =6T1•T3+14T2 •T3 +2T3 •T3 =-14*0.2+2*1=-0.8
Advantages of similarity coefficients • The documents can be arranged in decreasing order of corresponding similarity with the query. • The size of the retrieved set can be adapted to the users’ requirement by retrieving only the top few items. • Items retrieved early in a search may help generated improved query formulations using relevance feedback.
Association Measures Inner Product Dice coefficient Jaccard's coefficient Cosine coefficient Jaccard coefficient
Vector Modifications • How to generate query statement that can reflect information need? • How to generate improved query formulations?relevance-feedback process • Basic ideas • Documents relevant to a particular query resemble each other. • The reformulated query is expected to retrieve additional relevant items that are similar to the originally identified relevant item.
relevance-feedback process • Maximize the average query document similarity for the relevant documents. • Minimize the average query-document similarity for the nonrelevant documents.where R and N-R are the assumed number of relevant and nonrelevant documents w.r.t. queries.
relevance-feedback process(Continued) • problemthe sets of relevant and nonrelevant documents w.r.t. the queries are not known. • Approximationwhere R’ and N’ are subsets of R relevant items and N-R nonrelevant documents identified by user
The parameters of and • equal weight =0.5 and =0.5 • positive relevance-feedback =1 and =0
The parameters of and (Continued) • “dec hi” method: use all relevant information, but subtract only the highest ranked nonrelevant document • feedback with query splittingsolve problems: (1) the relevant documents identified do not form a tight cluster; (2) nonrelevant documents are scattered among certain relevant ones homogeneous relevant items homogeneous relevant items
Residual Collection with Partial Rank Freezing • The previously retrieved items identified as relevant are kept “frozen”; and the previously retrieved nonrelevant items are simple removed from the collection. Assume 10 documents are relevant.
Evaluating relevance feedback: Test-and-control collection evaluation relevance- feedback Refer to CLIR
Document-Space Modification • The relevance feedback process improve query formulation without modifying document vectors. • Document-space modification improves document indexes. • The documents will resemble each other more closely than before, and can be retrieved easily by a similar query.
Relevant documents resemble each other than before. Nonrelevant documents are shifted away from the query.
Document-Space Modification(Continued) • Add the terms from the query vector to the documents previously identified as relevant, and subtract the query terms from the documents previously identified as nonrelevant. • This operation can remove unimportant items from the collection, or shift them to an auxiliary portion. • Problemrelevance assessments by uses are subjective
Document-Space Modification(Continued) • Only small modifications of the term weights are allowed at each iteration. • Document-space-modification methods are difficult to evaluate in the laboratory, where no users are available to dynamically control the space modifications by submitting queries.
Automatic Document Classification • Searching vs. Browsing • Disadvantages in using inverted index files • information pertaining to a document is scattered among many different inverted-term lists • information relating to different documents with similar term assignment is not in close proximity in the file system • Approaches • inverted-index files (for searching) +clustered document collection (for browsing) • clustered file organization (for searching and browsing)
Typical Clustered File Organization clusters superclusters Hypercentroid Supercentroids Centroids Documents complete space
Search Strategy for Clustered Documents Highest-level centroid Supercentroids Centroids Documents Typical Search path Centroids Documents
Cluster Generation VS Cluster Search • Cluster structure is generated only once. • Cluster maintenance can be carried out at relatively infrequent intervals. • Cluster generation process may be slower and more expensive. • Cluster search operations may have to be performed continually. • Cluster search operations must be carried out efficiently.
Hierarchical Cluster Generation • Two strategies • pairwise item similarities • heuristic methods • Models • Divisive Clustering (top down) • The complete collection is assumed to represent one complete cluster. • Then the collection is subsequently broken down into smaller pieces. • Agglomerative Clustering (bottom up) • Individual item similarities are used as a starting point. • A gluing operation collects similar items, or groups, into larger group.
Term clustering: from column viewpoint Document clustering: from row viewpoint
A Naive Program for Hierarchical Agglomerative Clustering 1. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients) 2. Place each of N documents into a class of its own. 3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1.
How to Combine Clusters? • Single-link clustering • Each document must have a similarity exceeding a stated threshold value with at least one other document in the same class. • similarity between a pair of clusters is taken to be the similarity between the most similar pair of items • each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster
How to Combine Clusters? (Continued) • Complete-link Clustering • Each document has a similarity to all other documents in the same class that exceeds the the threshold value. • similarity between the least similar pair of items from the two clusters is used as the cluster similarity • each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster
How to Combine Clusters? (Continued) • Group-average clustering • a compromise between the extremes of single-link and complete-link systems • each cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster
Example for Agglomerative Clustering A-F (6 items) 6(6-1)/2 (15) pairwise similarities decreasing order
A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . Single-link Clustering 0.9 1. AF 0.9 A F sim(AF,X)=max(sim(A,X),sim(F,X)) AF B C D E AF . .8 .5 .6 .8 B .8 . .4 .5 .7 C .5 .4 . .3 .5 D .6 .5 .3 . .4 E .8 .7 .5 .4 . 0.8 2. AE 0.8 0.9 E A F sim(AEF,X)=max(sim(AF,X),sim(E,X))
Single-link Clustering (Continued) AEF B C D AEF . .8 .5 .6 B .8 . .4 .5 C .5 .4 . .3 D .6 .5 .3 . 0.8 3. BF 0.8 0.9 B E A F Note E and B are on the same level. sim(ABEF,X)=max(sim(AEF,X), sim(B,X)) ABEF C D ABEF . .5 .6 C .5 . .3 D .6 .3 . 0.8 4. BE 0.7 0.9 B E A F sim(ABDEF,X)=max(sim(ABEF,X)) sim(D,X))
Single-link Clustering (Continued) 0.6 ABDEF C ABDEF . .5 C .5 . 0.8 D 5. AD 0.6 0.9 B E A F 0.5 C 0.6 0.8 6. AC 0.5 D 0.9 B E A F
Single-Link Clusters • Similarity level 0.7 (i.e., similarity threshold) • Similarity level 0.5 (i.e., similarity threshold) E A F E B .8 .9 .8 .7 C D C .5 E F E A B .8 .9 .8 .7 .6 D
A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . Complete-link cluster generation Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations new 1. AF 0.9 0.9 A F sim(AF,X)=min(sim(A,X), sim(F,X)) check EF 2. AE 0.8 (A,E) (A,F) 3. BF 0.8 check AB (A,E) (A,F) (B,F)
Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations AF B C D E AF . .3 .2 .1 .3 B .3 . .4 .5 .7 C .2 .4 . .3 .5 D .1 .5 .3 . .4 E .3 .7 .5 .4 . new 0.7 4. BE 0.7 B E check DF (A,D)(A,E)(A,F) (B,E)(B,F) 5. AD 0.6 6. AC 0.6 check CF (A,C)(A,D)(A,E)(A,F) (B,E)(B,F) 7. BD 0.5 check DE (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)
Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Step Number Similarity Pair Check Operations Similarity Matrix AF BE C D AF . .3 .2 .1 BE .3 . .4 .4 C .2 .4 . .3 D .1 .4 .3 . check BC 8. CE 0.5 (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E) 0.4 check CE0.5 9. BC 0.4 0.7 C B E (in the checklist) 10. DE 0.4 Check BD0.5 DE (A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) Check AC0.5 AE0.8 BF0.8 CF , EF 11. AB 0.3 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E)
Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations 0.3 AF BCE D AF . .2 .1 BCE .2 . .3 D .1 .3 . 12. CD 0.3 Check BD0.5 DE0.4 0.4 D 0.7 C B E Check BF0.8 CF DF 13. EF 0.3 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(D,E)(E,F) Check BF0.8 EF0.3 DF 14. CF 0.2 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(C,F)(D,E)(E,F)
Complete-link cluster generation (Continued) 0.1 AF BCDE AF . .1 BCDE .1 . 15. DF 0.1 last pair 0.9 0.3 A F 0.4 D 0.7 C B E
Complete link clusters Similarity level 0.7 A F B E 0.9 0.7 C D Similarity level 0.4 A F B E 0.9 0.7 0.5 D 0.4 C Similarity level 0.3 D 0.5 0.4 A F B E 0.9 0.3 0.7 0.4 0.5 C
The Behavior of Single-Link Cluster • The single-link process tends to produce a small number of large clusters that are characterized by a chaining effect. • Each element is usually attached to only one other member of the same cluster at each similarity level. • It is sufficient to remember the list of previously clustered single items.
The Behavior of Complete-Link Cluster • Complete-link process produces a much larger number of small, tightly linked groupings. • Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level. • It is necessary to remember the list of all item pairs previously considered in the clustering process.
The Behavior of Complete-Link Cluster(Continued) • The complete-link clustering system may be better adapted to retrieval than the single-link clusters. • A complete-link cluster generation is more expensive to perform than a comparable single-link process.
How to Generate Similarity Di=(di1, di2, ..., dit) document vector for Di Lj=(lj1, lj2, ..., ljnj) inverted list for term Tj lji denotes document identifier of ith document listed under term Tj nj denote number of postings for term Tj for j=1 to t (for each of t possible terms) for i=1 to nj (for all nj entries on the jth list) compute sim(Dlji,Dlj,i+k) i+1<=k<=nj end for end for
Similarity without Recomputation for j=1 to N (fore each document in collection) set S(j)=0, 1<=j<=N for k=1 to nj (for each term in document) take up inverted list Lk for i=1 to nk (for each document identifier on list) if i<j or if Sji=1 take up next document Di else compute sim(Dj,Di) set Sji=1 end for end for end for
Heuristic Clustering Methods • Hierarchical clustering strategies • use all pairwise similarities between items • the clustering-generation are relatively expensive • produce a unique set of well-formed clusters for each set of data, regardless of the order in which the similarity pairs are introduced into the clustering process • Heuristic clustering methods • produce rough cluster arrangements at relatively little expense