130 likes | 196 Views
Learn how to compare and rank documents by relevance, utilizing keyword analysis, term frequency, and more. Explore methods such as TF*IDF and Vector Space for effective document retrieval and ranking strategies.
E N D
Comparing and Ranking Documents • Once our search engine has retrieved a set of documents, we may want to • Rank them by relevance • Which are the best fit to my query? • This involves determining what the query is about and how well the document answers it • Compare them • Show me more like this. • This involves determining what the document is about.
Determining Relevance by Keyword • The typical web query consists entirely of keywords. • Retrieval can be binary: present or absent • More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? • Simple strategies: • How many times does word occur in document? • How close to head of document? • If multiple keywords, how close together?
Keywords for Relevance Ranking • Count: repetition is an indicaiton of emphasis • Very fast (usually in the index) • Reasonable heuristic • Unduly influenced by document length • Can be "stuffed" by web designers • Position: Lead paragraphs summarize content • Requires more computation • Also reasonably heuristic • Less influenced by document length • Harder to "stuff"; can only have a few keywords near beginning
Keywords for Relevant Ranking • Proximity for multiple keywords • Requires even more computation • Obviously relevant only if have multiple keywords • Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all • Very hard to "stuff" • All keyword methods • Are computationally simple and adequately fast • Are effective heuristics • typically perform as well as in-depth natural language methods for standard search
Comparing Documents • "Find me more like this one" really means that we are using the document as a query. • This requires that we have some conception of what a document is about overall. • Depends on context of query. We need to • Characterize the entire content of this document • Discriminate between this document and others in the corpus
Comparing Documents cont • Two very general approaches: • statistical • semantic • We will discuss semantic approaches more in text mining • Statistical approach still focuses on keywords: • To what extent does each term characterize this document? • To what extent does each term discriminate this document from other documents?
Characterizing a Document: Term Frequency • Adocument can be treated as a sequence of words. • Each word characterizes that document to some extent. • When we have eliminated stop words, the most frequent words tend to be what the document is about • Therefore: fkd (# of occurrences of word K in document d) will be an important measure. • Also called the term frequency
Characterizing a Document: Document Frequency • What makes this document distinct from others in the corpus? • The terms which discriminate best are not those which occur with high frequency! • Therefore: Dk (# of documents in which word K occurs) will also be an important measure. • Also called the document frequency
TF*IDF • This can all be summarized as: • Words are best discriminators when they • occur often in this document (term frequency) • don’t occur in a lot of documents (document frequency) • One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency • There are multiple formulas for actually computing this; the book gives Robertson and Jones. The underlying concept is the same in all of them.
Describing an Entire Document • So what is a document about? • TF*IDF: can simply list keywords in order of their TF*IDF values • Document is about all of them to some degree: it is at some point in some vector space of meaning
Vector Space • Any corpus has defined set of terms (index) • These terms define a knowledge space • Every document is somewhere in that knowledge space -- it is or is not about each of those terms. • Consider each term as a vector. Then • We have an n-dimensional vector space • Where n is the number of terms (very large!) • Each document is a point in that vector space • The document position in this vector space can be treated as what the document is about.
Similarity Between Documents • How similar are two documents? • Measures of association • How much do the feature sets overlap? • Modified for length: DICE coefficient • DICE coefficient: # terms compared to intersection • Simple Matching coefficient: take into account exclusions • Cosine similarity • similarity of angle of the two document vectors • not sensitive to vector length
Bag of Words • All of these techniques are what is known as bag of words approaches. • Keywords treated in isolation • Difference between "man bites dog" and "dog bites man" non-existent • In text mining will discuss linguistic approaches which pay attention to semantics