1 / 20

Vector Space Methods for Document Similarity Measurement

Learn how to measure document similarity using vector space methods without assuming exact matches. Understand n-dimensional space concepts and compute similarity using vectors. Explore cosine similarity and basic vector methods.

rfoote
Download Presentation

Vector Space Methods for Document Similarity Measurement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 430: Information Discovery Lecture 4 Vector Methods

  2. Course Administration • Assignment 1 should be posted tomorrow. Submission instructions will be added on Monday.

  3. Vector Space Methods Problem:Given two text documents, how similar are they? Vector space methods that measure similarity do not assume exact matches. Example Here are three documents. How similar are they? d1 ant ant bee d2 dog bee dog hog dog ant dog d3 cat gnu dog eel fox Documents can be any length from one word to thousands. One document may be a query.

  4. Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors.

  5. Three Terms Represented in 3 Dimensions t3 d1 d2 t2  t1

  6. Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 +x13x23 + ... + x1nx1n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

  7. Basic Method: Incidence Array (No Weighting) terms in d1 -> ant ant bee terms in d2 -> dog bee dog hog dog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog length d1 1 1 2 d2 1 1 1 1 4 d3 1 1 1 1 1 5 Weights: tij = 1 if document i contains term j and zero otherwise

  8. Example 1 (continued) Similarity of documents in example: d1d2d3 d1 1 0.71 0 d2 0.71 1 0.22 d3 0 0.22 1

  9. Vector Similarity Computation: Summary Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, tij = 0 if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i) Similarity between di and dj is defined as  tiktjk |di| |dj| n cos(di, dj) = k=1

  10. Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity more than a threshold, t, e.g., t = 0.50. Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

  11. Contrast with Boolean Searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document

  12. Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

  13. Results of a Search x x hits from search x  x x x x x documents found by search  query

  14. Relevance Feedback (Concept)   hits from original search x x o  x x o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query

  15. Document Clustering (Concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

  16. Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

  17. Categories of Weighting Term Frequency A term that appears many times within a document is likely to be a better discriminator than a term that appears only once. Document Frequency A term that appears in many documents is likely to be a less good discriminator than one that appears in few documents. Document Length Appearance of a term in a short documents is likely to be a better discriminator than one that appears in a long document.

  18. Term Weighting: Term Frequency Similarity calculated from an incidence matrix, without weighting, measures the occurrences of terms, but no other characteristics of the documents. Definition: The term frequency is the number of times that it occurs in a document) Notation:tf A frequency matrix weighs each term by the number of times that it occurs in a document. Similarity calculated from a frequency matrix is likely to provide more information about a document than without weights.

  19. Frequency Matrix(Weighting by Term Frequency) terms in d1 -> ant ant bee terms in d2 -> dog bee dog hog dog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length d1 2 1 5 d2 1 1 4 1 19 d3 1 1 1 1 1 5 Weights:tij = frequency that term j occurs in document i

  20. Example 2 (continued) Similarity of documents in example: d1d2d3 d1 1 0.31 0 d2 0.31 1 0.41 d3 0 0.41 1 Similarity depends upon the weights given to the terms. [Note differences in results from Example 1 with no weighting.]

More Related