1 / 19

Term weighting and vector representation of text

Term weighting and vector representation of text. Lecture 3. Last lecture. Tokenization Token, type, term distinctions Case normalization Stemming and lemmatization Stopwords 30 most common words account for 30% of the tokens in written text . Last lecture: empirical laws.

cade
Download Presentation

Term weighting and vector representation of text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term weighting and vector representation of text Lecture 3

  2. Last lecture • Tokenization • Token, type, term distinctions • Case normalization • Stemming and lemmatization • Stopwords • 30 most common words account for 30% of the tokens in written text

  3. Last lecture: empirical laws • Heap’s law: estimating the size of the vocabulary • Linear in the number of tokens in log-log space • Zipf’s law: frequency distribution of terms

  4. This class • Term weighting • Tf.idf • Vector representations of text • Computing similarity • Relevance • Redundancy identification

  5. Term frequency and weighting • A word that appears often in a document is probably very descriptive of what the document is about • Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document • Term frequency (tf) • Assign the weight to be equal to the number of occurrences of term t in document d

  6. Bag of words model • A document can now be viewed as the collection of terms in it and their associated weight • Mary is smarter than John • John is smarter than Mary • Equivalent in the bag of words model

  7. Problems with term frequency • Stop words • Semantically vacuous • Auto industry • “Auto” or “car” would not be at all indicative about what a document/sentence is about • Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination

  8. Scale down the term weight of terms with high collection frequency • Reduce the tf weight of a term by a factor that grows with the collection frequency • More common for this purpose is document frequency (how many documents in the collection contain the term)

  9. Inverse document frequency • N number of documents in the collection • N = 1000; df[the] = 1000; idf[the] = 0 • N = 1000; df[some] = 100; idf[some] = 2.3 • N = 1000; df[car] = 10; idf[car] = 4.6 • N = 1000; df[merger] = 1; idf[merger] = 6.9

  10. it.idf weighting • Highest when t occurs many times within a small number of documents • Thus lending high discriminating power to those documents • Lower when the term occurs fewer times in a document, or occurs in many documents • Thus offering a less pronounced relevance signal • Lowest when the term occurs in virtually all documents

  11. Document vector space representation • Each document is viewed as a vector with one component corresponding to each term in the dictionary • The value of each component is the tf-idf score for that word • For dictionary terms that do not occur in the document, the weights are 0

  12. How do we quantify the similarity between documents in vector space? • Magnitude of the vector difference • Problematic when the length of the documents is different

  13. Cosine similarity

  14. The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors • So cosine similarity is the dot product of the normalized versions of the two documents

  15. Tf modifications • It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence

  16. Maximum tf normalization

  17. Problems with the normalization • A change in the stop word list can dramatically alter term weightings • A document may contain an outlier term

More Related