1 / 35

CS533 Information Retrieval

This lecture introduces assignment 2, discusses dictionary-based and successive variety stemming, and explores the vector space model for modern information retrieval. It covers topics such as accepting natural language queries, ranking documents based on vocabulary overlap, and the use of term weights for similarity calculations.

Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 8, 1999

  2. This lecture • Introduce assignment 2 • Dictionary based and successive variety stemming • “Modern” information retrieval • The vector space model

  3. Modern Information retrieval • Accept natural language queries • Useful documents and queries have substantial vocabulary overlap • Degree of overlap enables ranking

  4. Natural language queries • Queries easier to formulate • Selection of document paragraph as query • Clustering documents by similarity • Automatic creation of hyperlinks

  5. Ranking documents • Top ranked documents should be best • Enables restricting output to top n documents • Very top relevant (and non relevant) docs can be used for query expansion • (relevance feedback)

  6. Comparison to Boolean • In conventional Boolean retrieval: • output size cannot be limited. • best documents spread anywhere in output

  7. The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent

  8. The vector space model • Assume collection vocabulary of t terms • This allows documents and queries to be represented by vectors of t dimensions

  9. The document vector • The ith document, Di, is represented by Di=(di1,…, dit) where dik for k=1,…,t is the weight of term k in the document

  10. The vector space model • Similarly, the query, Q, is represented by Q=(q1,…, qt) where qk for k=1,…,t is the weight of term k in the query

  11. Term weights • Binary • w = 1 if term present in document • w = 0 if not • Real number w representing the “importance” of the term • w > 0 if the term present in document • w = 0 if not

  12. The vector space model • Computes similarity between vectors X=(x1,x2,…,xt) and Y=(y1, y2,…,yt) • xi - weight of term i in the document and • yi - weight of term i in the query

  13. The vector space model • For binary weights: • Let |X| = number of 1s in the document and • |Y| = number of 1s in query

  14. Similarity measures

  15. Similarity measures

  16. Retrieval examples • Given: • The number of terms t • The number of documents N • The document/term (N*t) weight matrix • A query • Use inner product to produce a ranked list of retrieved documents

  17. Example 1- Boolean Weights t=5072 term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 1 0 1 0 0 Doc-2 1 1 1 0 1 1 ... Doc-N 0 1 0 1 1 0 Query 1 1 0 0 1 0

  18. Retrieval example 1 • Sim(Q, Doc-1) = 1, • Sim(Q, Doc-2) = 3, and • ... • Sim((Q, Doc-N) = 2. • The ranked list is Doc-2 (3), Doc-N (2), Doc-1 (1)

  19. Example 2 Term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 0.3 0 0.5 0 0 Doc-2 0.2 0.6 0.3 0 0.8 0.3 ... Doc-N 0 0.2 0 0 0.6 0 Query 0.3 0.7 0 0 0.7 0

  20. Retrieval Example 2 • Using inner product: • For Doc-1 0.3*0 + 0.7*0.3 + 0.7*0 = 0.21 • For Doc-2 0.3*0.2 + 0.7*0.6 + 0.7*0.8 = 1.04 • For Doc-N 0.3*0 + 0.7*0.2 + 0.7*0.6 = 0.56

  21. Retrieval example 2 • Important query terms are 2, and 693 • Query term 1 is less important • Terms 2 and 693 important for Doc-2 • Doc-2 retrieved with high similarity and rank

  22. Retrieval Example 2 • To calculate the similarity for other documents we need their term weights

  23. Vector space model • Assume terms are dependent • Each term i is represented by a vector Ti • Let dri be weight of term i in the document Dr, and • Let qsi be weight of term i in the query Qs.

  24. Calculating the similarity

  25. Calculating the similarity

  26. Calculating the similarity • To compute similarity we need the term correlation TiTj for all pairs of terms Ti and Tj • These correlations are not easy to compute

  27. Example 3 • The doc/term weights are given • The correlations between terms are also given. • Using inner product produce a ranked list of documents

  28. D1=2T1+3T2+5T3 D2=3T1+7T2+1T3 Q =0T1+0T2+2T3 T1 T2 T3 T1 1 .5 0 T2 .5 1 -.2 T3 0 -.2 1 Example 3

  29. Example 3 • sim(D1, Q) = (2T1+ 3T2 + 5T3) * • (0T1 + 0T2 + 2T3) = • 4T1T3 + 6T2T3 + 10T3T3 = • 4*0-6*0.2+10*1=8.8

  30. Example 3 sim(D2, Q) = (3T1+ 7T2 + 1T3) * (0T1 + 0T2 + 2T3) = 6T1T3 + 14T2T3 + 2T3T3 = 6*0-14*0.2+2*1=-.8

  31. Term correlations • Approximating term correlations • We use the term/document weight matrix • We represent term Ti as linear combination of N document vectors • The problem now is document correlations

  32. Approximating the correlations

  33. Term vectors • Clearly documents are not semantically independent • The vector space model gets good retrieval results assuming term independence

  34. Term weights • Good similarity values depend on good term weights • Different retrieval models, use different assumptions and different weight formulas

  35. Term weights • In the vector space model the weight of a term depends on: • A measure of recurrence • A measure of term discrimination • A normalization factor

More Related