Term weighting and vector representation of text

Term weighting and vector representation of text Lecture 3

Last lecture • Tokenization • Token, type, term distinctions • Case normalization • Stemming and lemmatization • Stopwords • 30 most common words account for 30% of the tokens in written text

Last lecture: empirical laws • Heap’s law: estimating the size of the vocabulary • Linear in the number of tokens in log-log space • Zipf’s law: frequency distribution of terms

This class • Term weighting • Tf.idf • Vector representations of text • Computing similarity • Relevance • Redundancy identification

Term frequency and weighting • A word that appears often in a document is probably very descriptive of what the document is about • Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document • Term frequency (tf) • Assign the weight to be equal to the number of occurrences of term t in document d

Bag of words model • A document can now be viewed as the collection of terms in it and their associated weight • Mary is smarter than John • John is smarter than Mary • Equivalent in the bag of words model

Problems with term frequency • Stop words • Semantically vacuous • Auto industry • “Auto” or “car” would not be at all indicative about what a document/sentence is about • Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination

Scale down the term weight of terms with high collection frequency • Reduce the tf weight of a term by a factor that grows with the collection frequency • More common for this purpose is document frequency (how many documents in the collection contain the term)

Inverse document frequency • N number of documents in the collection • N = 1000; df[the] = 1000; idf[the] = 0 • N = 1000; df[some] = 100; idf[some] = 2.3 • N = 1000; df[car] = 10; idf[car] = 4.6 • N = 1000; df[merger] = 1; idf[merger] = 6.9

it.idf weighting • Highest when t occurs many times within a small number of documents • Thus lending high discriminating power to those documents • Lower when the term occurs fewer times in a document, or occurs in many documents • Thus offering a less pronounced relevance signal • Lowest when the term occurs in virtually all documents

Document vector space representation • Each document is viewed as a vector with one component corresponding to each term in the dictionary • The value of each component is the tf-idf score for that word • For dictionary terms that do not occur in the document, the weights are 0

How do we quantify the similarity between documents in vector space? • Magnitude of the vector difference • Problematic when the length of the documents is different

Cosine similarity

The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors • So cosine similarity is the dot product of the normalized versions of the two documents

Tf modifications • It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence

Maximum tf normalization

Problems with the normalization • A change in the stop word list can dramatically alter term weightings • A document may contain an outlier term

Term weighting and vector representation of text