200 likes | 407 Views
Term weighting and vector representation of text. Lecture 3. Last lecture. Tokenization Token, type, term distinctions Case normalization Stemming and lemmatization Stopwords 30 most common words account for 30% of the tokens in written text . Last lecture: empirical laws.
E N D
Last lecture • Tokenization • Token, type, term distinctions • Case normalization • Stemming and lemmatization • Stopwords • 30 most common words account for 30% of the tokens in written text
Last lecture: empirical laws • Heap’s law: estimating the size of the vocabulary • Linear in the number of tokens in log-log space • Zipf’s law: frequency distribution of terms
This class • Term weighting • Tf.idf • Vector representations of text • Computing similarity • Relevance • Redundancy identification
Term frequency and weighting • A word that appears often in a document is probably very descriptive of what the document is about • Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document • Term frequency (tf) • Assign the weight to be equal to the number of occurrences of term t in document d
Bag of words model • A document can now be viewed as the collection of terms in it and their associated weight • Mary is smarter than John • John is smarter than Mary • Equivalent in the bag of words model
Problems with term frequency • Stop words • Semantically vacuous • Auto industry • “Auto” or “car” would not be at all indicative about what a document/sentence is about • Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination
Scale down the term weight of terms with high collection frequency • Reduce the tf weight of a term by a factor that grows with the collection frequency • More common for this purpose is document frequency (how many documents in the collection contain the term)
Inverse document frequency • N number of documents in the collection • N = 1000; df[the] = 1000; idf[the] = 0 • N = 1000; df[some] = 100; idf[some] = 2.3 • N = 1000; df[car] = 10; idf[car] = 4.6 • N = 1000; df[merger] = 1; idf[merger] = 6.9
it.idf weighting • Highest when t occurs many times within a small number of documents • Thus lending high discriminating power to those documents • Lower when the term occurs fewer times in a document, or occurs in many documents • Thus offering a less pronounced relevance signal • Lowest when the term occurs in virtually all documents
Document vector space representation • Each document is viewed as a vector with one component corresponding to each term in the dictionary • The value of each component is the tf-idf score for that word • For dictionary terms that do not occur in the document, the weights are 0
How do we quantify the similarity between documents in vector space? • Magnitude of the vector difference • Problematic when the length of the documents is different
The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors • So cosine similarity is the dot product of the normalized versions of the two documents
Tf modifications • It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence
Problems with the normalization • A change in the stop word list can dramatically alter term weightings • A document may contain an outlier term