Document Comparison and Ranking Techniques

Comparing and Ranking Documents • Once our search engine has retrieved a set of documents, we may want to • Rank them by relevance • Which are the best fit to my query? • This involves determining what the query is about and how well the document answers it • Compare them • Show me more like this. • This involves determining what the document is about.

Determining Relevance by Keyword • The typical web query consists entirely of keywords. • Retrieval can be binary: present or absent • More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? • Simple strategies: • How many times does word occur in document? • How close to head of document? • If multiple keywords, how close together?

Keywords for Relevance Ranking • Count: repetition is an indicaiton of emphasis • Very fast (usually in the index) • Reasonable heuristic • Unduly influenced by document length • Can be "stuffed" by web designers • Position: Lead paragraphs summarize content • Requires more computation • Also reasonably heuristic • Less influenced by document length • Harder to "stuff"; can only have a few keywords near beginning

Keywords for Relevant Ranking • Proximity for multiple keywords • Requires even more computation • Obviously relevant only if have multiple keywords • Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all • Very hard to "stuff" • All keyword methods • Are computationally simple and adequately fast • Are effective heuristics • typically perform as well as in-depth natural language methods for standard search

Comparing Documents • "Find me more like this one" really means that we are using the document as a query. • This requires that we have some conception of what a document is about overall. • Depends on context of query. We need to • Characterize the entire content of this document • Discriminate between this document and others in the corpus

Comparing Documents cont • Two very general approaches: • statistical • semantic • We will discuss semantic approaches more in text mining • Statistical approach still focuses on keywords: • To what extent does each term characterize this document? • To what extent does each term discriminate this document from other documents?

Characterizing a Document: Term Frequency • Adocument can be treated as a sequence of words. • Each word characterizes that document to some extent. • When we have eliminated stop words, the most frequent words tend to be what the document is about • Therefore: fkd (# of occurrences of word K in document d) will be an important measure. • Also called the term frequency

Characterizing a Document: Document Frequency • What makes this document distinct from others in the corpus? • The terms which discriminate best are not those which occur with high frequency! • Therefore: Dk (# of documents in which word K occurs) will also be an important measure. • Also called the document frequency

TF*IDF • This can all be summarized as: • Words are best discriminators when they • occur often in this document (term frequency) • don’t occur in a lot of documents (document frequency) • One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency • There are multiple formulas for actually computing this; the book gives Robertson and Jones. The underlying concept is the same in all of them.

Describing an Entire Document • So what is a document about? • TF*IDF: can simply list keywords in order of their TF*IDF values • Document is about all of them to some degree: it is at some point in some vector space of meaning

Vector Space • Any corpus has defined set of terms (index) • These terms define a knowledge space • Every document is somewhere in that knowledge space -- it is or is not about each of those terms. • Consider each term as a vector. Then • We have an n-dimensional vector space • Where n is the number of terms (very large!) • Each document is a point in that vector space • The document position in this vector space can be treated as what the document is about.

Similarity Between Documents • How similar are two documents? • Measures of association • How much do the feature sets overlap? • Modified for length: DICE coefficient • DICE coefficient: # terms compared to intersection • Simple Matching coefficient: take into account exclusions • Cosine similarity • similarity of angle of the two document vectors • not sensitive to vector length

Bag of Words • All of these techniques are what is known as bag of words approaches. • Keywords treated in isolation • Difference between "man bites dog" and "dog bites man" non-existent • In text mining will discuss linguistic approaches which pay attention to semantics

Document Comparison and Ranking Techniques

Document Comparison and Ranking Techniques

Presentation Transcript

Searching and Ranking Documents based on Semantic Relationships

How to Create Top Ranking Searchable and Accessible Documents

Extracting and Ranking Product Features in Opinion Documents

COMPARING AND CONTRASTING

Comparing and Contrasting

Comparing and Contrasting

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

Comparing and Contrasting

Comparing and Contrasting

Crawling and Ranking

Comparing and Branching

Comparing and contrasting

Ranking Documents based on Relevance of Semantic Relationships

Score-based ranking of the documents

Comparing and Ordering

Ranking and Learning

Scoring and Ranking

Comparing and Branching

Collaboration: Comparing documents, digital signatures, and online applications

Comparing and Scaling

Crawling and Ranking

Ranking and Learning