1 / 58

Advanced Information- Retrieval Models

Advanced Information- Retrieval Models. Hsin-Hsi Chen. Mathematical Models for IR. Mathematical Models for Information Retrieval Boolean Model Compare Boolean query statements with the term sets used to identify document content.

Download Presentation

Advanced Information- Retrieval Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Advanced Information-Retrieval Models Hsin-Hsi Chen

  2. Mathematical Models for IR • Mathematical Models for Information Retrieval • Boolean ModelCompare Boolean query statements with the term sets used to identify document content. • probabilistic modelCompute the relevance probabilities for the documents of a collection. • vector-space modelRepresent both queries and documents by term sets, and compute global similarities between queries and documents.

  3. Basic Vector Space Model • Term vector representation of documents Di=(ai1, ai2, …, ait) queries Qj=(qj1, qj2, …, qjt) • t distinct terms are used to characterize content. • Each term is identified with a term vector T. • T vectors are linearly independent. • Any vector is represented as a linear combination of the T term vectors. • The rth document Dr can be represented as a document vector, written as

  4. Document representation in vector space a document vector in a two-dimensional vector space

  5. Similarity Measure • measure by product of two vectorsx • y = |x| |y| cos • document-query similarity • how to determine the vector components and term correlations? document vector: term vector:

  6. Similarity Measure (Continued) • vector components

  7. Similarity Measure (Continued) • term correlations Ti • Tj are not availableassumption: term vectors are orthogonalTi • Tj =0 (ij) Ti • Tj =1 (i=j) • Assume that terms are uncorrelated. • Similarity measurement between documents

  8. Sample query-documentsimilarity computation • D1=2T1+3T2+5T3D2=3T1+7T2+1T3Q=0T1+0T2+2T3 • similarity computations for uncorrelated termssim(D1,Q)=2•0+3 •0+5 •2=10sim(D2,Q)=3•0+7 •0+1 •2=2

  9. Sample query-documentsimilarity computation (Continued) • T1T2T3T1 1 0.5 0T2 0.5 1 -0.2T3 0 -0.2 1 • similarity computations for correlated termssim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 ) =4T1•T3+6T2 •T3 +10T3 •T3 =-6*0.2+10*1=8.8sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 ) =6T1•T3+14T2 •T3 +2T3 •T3 =-14*0.2+2*1=-0.8

  10. Advantages of similarity coefficients • The documents can be arranged in decreasing order of corresponding similarity with the query. • The size of the retrieved set can be adapted to the users’ requirement by retrieving only the top few items. • Items retrieved early in a search may help generated improved query formulations using relevance feedback.

  11. Association Measures Inner Product Dice coefficient Jaccard's coefficient Cosine coefficient Jaccard coefficient

  12. Vector Modifications • How to generate query statement that can reflect information need? • How to generate improved query formulations?relevance-feedback process • Basic ideas • Documents relevant to a particular query resemble each other. • The reformulated query is expected to retrieve additional relevant items that are similar to the originally identified relevant item.

  13. relevance-feedback process • Maximize the average query document similarity for the relevant documents. • Minimize the average query-document similarity for the nonrelevant documents.where R and N-R are the assumed number of relevant and nonrelevant documents w.r.t. queries.

  14. relevance-feedback process(Continued) • problemthe sets of relevant and nonrelevant documents w.r.t. the queries are not known. • Approximationwhere R’ and N’ are subsets of R relevant items and N-R nonrelevant documents identified by user

  15. The parameters of  and  • equal weight =0.5 and =0.5 • positive relevance-feedback =1 and =0

  16. The parameters of  and  (Continued) • “dec hi” method: use all relevant information, but subtract only the highest ranked nonrelevant document • feedback with query splittingsolve problems: (1) the relevant documents identified do not form a tight cluster; (2) nonrelevant documents are scattered among certain relevant ones homogeneous relevant items homogeneous relevant items

  17. Residual Collection with Partial Rank Freezing • The previously retrieved items identified as relevant are kept “frozen”; and the previously retrieved nonrelevant items are simple removed from the collection. Assume 10 documents are relevant.

  18. Evaluating relevance feedback: Test-and-control collection evaluation relevance- feedback Refer to CLIR

  19. Document-Space Modification • The relevance feedback process improve query formulation without modifying document vectors. • Document-space modification improves document indexes. • The documents will resemble each other more closely than before, and can be retrieved easily by a similar query.

  20. Relevant documents resemble each other than before. Nonrelevant documents are shifted away from the query.

  21. Document-Space Modification(Continued) • Add the terms from the query vector to the documents previously identified as relevant, and subtract the query terms from the documents previously identified as nonrelevant. • This operation can remove unimportant items from the collection, or shift them to an auxiliary portion. • Problemrelevance assessments by uses are subjective

  22. Document-Space Modification(Continued) • Only small modifications of the term weights are allowed at each iteration. • Document-space-modification methods are difficult to evaluate in the laboratory, where no users are available to dynamically control the space modifications by submitting queries.

  23. Automatic Document Classification • Searching vs. Browsing • Disadvantages in using inverted index files • information pertaining to a document is scattered among many different inverted-term lists • information relating to different documents with similar term assignment is not in close proximity in the file system • Approaches • inverted-index files (for searching) +clustered document collection (for browsing) • clustered file organization (for searching and browsing)

  24. Typical Clustered File Organization clusters superclusters Hypercentroid Supercentroids Centroids Documents complete space

  25. Search Strategy for Clustered Documents Highest-level centroid Supercentroids Centroids Documents Typical Search path Centroids Documents

  26. Cluster Generation VS Cluster Search • Cluster structure is generated only once. • Cluster maintenance can be carried out at relatively infrequent intervals. • Cluster generation process may be slower and more expensive. • Cluster search operations may have to be performed continually. • Cluster search operations must be carried out efficiently.

  27. Hierarchical Cluster Generation • Two strategies • pairwise item similarities • heuristic methods • Models • Divisive Clustering (top down) • The complete collection is assumed to represent one complete cluster. • Then the collection is subsequently broken down into smaller pieces. • Agglomerative Clustering (bottom up) • Individual item similarities are used as a starting point. • A gluing operation collects similar items, or groups, into larger group.

  28. Term clustering: from column viewpoint Document clustering: from row viewpoint

  29. A Naive Program for Hierarchical Agglomerative Clustering 1. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients) 2. Place each of N documents into a class of its own. 3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1.

  30. How to Combine Clusters? • Single-link clustering • Each document must have a similarity exceeding a stated threshold value with at least one other document in the same class. • similarity between a pair of clusters is taken to be the similarity between the most similar pair of items • each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

  31. How to Combine Clusters? (Continued) • Complete-link Clustering • Each document has a similarity to all other documents in the same class that exceeds the the threshold value. • similarity between the least similar pair of items from the two clusters is used as the cluster similarity • each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster

  32. How to Combine Clusters? (Continued) • Group-average clustering • a compromise between the extremes of single-link and complete-link systems • each cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster

  33. Example for Agglomerative Clustering A-F (6 items) 6(6-1)/2 (15) pairwise similarities decreasing order

  34. A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . Single-link Clustering 0.9 1. AF 0.9 A F sim(AF,X)=max(sim(A,X),sim(F,X)) AF B C D E AF . .8 .5 .6 .8 B .8 . .4 .5 .7 C .5 .4 . .3 .5 D .6 .5 .3 . .4 E .8 .7 .5 .4 . 0.8 2. AE 0.8 0.9 E A F sim(AEF,X)=max(sim(AF,X),sim(E,X))

  35. Single-link Clustering (Continued) AEF B C D AEF . .8 .5 .6 B .8 . .4 .5 C .5 .4 . .3 D .6 .5 .3 . 0.8 3. BF 0.8 0.9 B E A F Note E and B are on the same level. sim(ABEF,X)=max(sim(AEF,X), sim(B,X)) ABEF C D ABEF . .5 .6 C .5 . .3 D .6 .3 . 0.8 4. BE 0.7 0.9 B E A F sim(ABDEF,X)=max(sim(ABEF,X)) sim(D,X))

  36. Single-link Clustering (Continued) 0.6 ABDEF C ABDEF . .5 C .5 . 0.8 D 5. AD 0.6 0.9 B E A F 0.5 C 0.6 0.8 6. AC 0.5 D 0.9 B E A F

  37. Single-Link Clusters • Similarity level 0.7 (i.e., similarity threshold) • Similarity level 0.5 (i.e., similarity threshold) E A F E B .8 .9 .8 .7 C D C .5 E F E A B .8 .9 .8 .7 .6 D

  38. A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . Complete-link cluster generation Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations new 1. AF 0.9 0.9 A F sim(AF,X)=min(sim(A,X), sim(F,X)) check EF 2. AE 0.8 (A,E) (A,F) 3. BF 0.8 check AB (A,E) (A,F) (B,F)

  39. Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations AF B C D E AF . .3 .2 .1 .3 B .3 . .4 .5 .7 C .2 .4 . .3 .5 D .1 .5 .3 . .4 E .3 .7 .5 .4 . new 0.7 4. BE 0.7 B E check DF (A,D)(A,E)(A,F) (B,E)(B,F) 5. AD 0.6 6. AC 0.6 check CF (A,C)(A,D)(A,E)(A,F) (B,E)(B,F) 7. BD 0.5 check DE (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)

  40. Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Step Number Similarity Pair Check Operations Similarity Matrix AF BE C D AF . .3 .2 .1 BE .3 . .4 .4 C .2 .4 . .3 D .1 .4 .3 . check BC 8. CE 0.5 (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E) 0.4 check CE0.5 9. BC 0.4 0.7 C B E (in the checklist) 10. DE 0.4 Check BD0.5 DE (A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) Check AC0.5 AE0.8 BF0.8 CF  , EF 11. AB 0.3 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E)

  41. Complete-link cluster generation (Continued) Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations 0.3 AF BCE D AF . .2 .1 BCE .2 . .3 D .1 .3 . 12. CD 0.3 Check BD0.5 DE0.4 0.4 D 0.7 C B E Check BF0.8 CF DF  13. EF 0.3 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(D,E)(E,F) Check BF0.8 EF0.3 DF  14. CF 0.2 (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(C,F)(D,E)(E,F)

  42. Complete-link cluster generation (Continued) 0.1 AF BCDE AF . .1 BCDE .1 . 15. DF 0.1 last pair 0.9 0.3 A F 0.4 D 0.7 C B E

  43. Complete link clusters Similarity level 0.7 A F B E 0.9 0.7 C D Similarity level 0.4 A F B E 0.9 0.7 0.5 D 0.4 C Similarity level 0.3 D 0.5 0.4 A F B E 0.9 0.3 0.7 0.4 0.5 C

  44. The Behavior of Single-Link Cluster • The single-link process tends to produce a small number of large clusters that are characterized by a chaining effect. • Each element is usually attached to only one other member of the same cluster at each similarity level. • It is sufficient to remember the list of previously clustered single items.

  45. The Behavior of Complete-Link Cluster • Complete-link process produces a much larger number of small, tightly linked groupings. • Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level. • It is necessary to remember the list of all item pairs previously considered in the clustering process.

  46. The Behavior of Complete-Link Cluster(Continued) • The complete-link clustering system may be better adapted to retrieval than the single-link clusters. • A complete-link cluster generation is more expensive to perform than a comparable single-link process.

  47. How to Generate Similarity Di=(di1, di2, ..., dit) document vector for Di Lj=(lj1, lj2, ..., ljnj) inverted list for term Tj lji denotes document identifier of ith document listed under term Tj nj denote number of postings for term Tj for j=1 to t (for each of t possible terms) for i=1 to nj (for all nj entries on the jth list) compute sim(Dlji,Dlj,i+k) i+1<=k<=nj end for end for

  48. Similarity without Recomputation for j=1 to N (fore each document in collection) set S(j)=0, 1<=j<=N for k=1 to nj (for each term in document) take up inverted list Lk for i=1 to nk (for each document identifier on list) if i<j or if Sji=1 take up next document Di else compute sim(Dj,Di) set Sji=1 end for end for end for

  49. Heuristic Clustering Methods • Hierarchical clustering strategies • use all pairwise similarities between items • the clustering-generation are relatively expensive • produce a unique set of well-formed clusters for each set of data, regardless of the order in which the similarity pairs are introduced into the clustering process • Heuristic clustering methods • produce rough cluster arrangements at relatively little expense

More Related