Information Retrieval

Information Retrieval Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Information Retrieval Systems key word query Document IR System document Yan Huang - CSCI5330 Database Implementation –Information Retrieval

In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document Ranking of documents on the basis of estimated relevance to a query is critical Keyword Search Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Similarity Based Retrieval • Similarity based retrieval - retrieve documents similar to a given document • Similarity can be used to refine answer set to keyword query • User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Similarity Measures • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: • It is possible to rank the retrieved documents in the order of presumed relevance. • It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Relevance Ranking • Relevance ranking is based on factors such as • Term frequency • Frequency of occurrence of query keyword in document • Inverse document frequency • How many documents the query keyword occurs in • Fewer  give more importance to keyword • Hyperlinks to documents • More links to a document  document is more important Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Relevance Ranking Using Terms (Cont.) • Most systems add to the above model • Words that occur in title, author list, section headings, etc. are given greater importance • Words whose first occurrence is late in the document are given lower importance • Very common words such as “a”, “an”, “the”, “it” etc are eliminated • Called stop words • Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Vector Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Term Weights • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij =fij / maxi{fij} Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Reverse Term Weights • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfijlog2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Inner Product Measure • Similarity between vectors for the document di and query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq where wijis the weight of term i in document j andwiq is the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Inner Product -- Examples Problems? architecture management information Binary: • D = 1, 1, 1, 0, 1, 1, 0 • Q = 1, 0 , 1, 0, 0, 1, 1 sim(D, Q) = 3 computer text retrieval database Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1, Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2, Q) = 3*0 + 7*0 + 1*2 = 2 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

t3 1 D1 Q 2 t1 t2 D2 Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths. CosSim(dj, q) = Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Relevance Using Hyperlinks • Problem with key words search? • Problem with most frequented visited website search? • Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords • Problem: hard to find actual popularity of site Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Different Ranking Factors • Key word and anchor text based search find all the related pages first • PageRank rank the search result set • A high ranked page is not interesting to you at all if it is not related Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Link Counts Taher’s Home Page Sep’s Home Page CS361 DB Pub Server CNN Yahoo! Linked by 2 Unimportant pages Linked by 2 Important Pages Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Definition of PageRank let us calculate Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Definition of PageRank 1/2 1/2 1 1 0.05 0.25 0.1 0.1 0.1 Sep Taher DB Pub Server CNN Yahoo! Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.333 0.333 0.333 Initialize all nodes to rank Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.167 0.333 0.333 0.167 Propagate ranks across links (multiplying by link weights) Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.5 0.333 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.167 0.5 0.167 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.333 0.5 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

PageRank Diagram 0.4 0.4 0.2 After a while… Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Computing PageRank importance of page i importance of page j number of outlinks from page j pages j that link to page i • Initialize: • Repeat until convergence: Yan Huang - CSCI5330 Database Implementation –Information Retrieval

importance of page i importance of page j number of outlinks from page j pages j that link to page i Definition of PageRank • The importance of a page is given by the importance of the pages that link to it • d is a damping factor, usually 0.85 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Synonyms and Homonyms • Synonyms • E.g. document: “motorcycle repair”, query: “motorcycle maintenance” • need to realize that “maintenance” and “repair” are synonyms • System can extend query as “motorcycle and (repair or maintenance)” • Homonyms • E.g. “object” has different meanings as noun/verb • Can disambiguate meanings (to some extent) from the context Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Indexing of Documents • An inverted index maps each keyword Ki to a set of documents Sithat contain the keyword • Documents identified by identifiers Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Relevant performance metrics: Precision - what percentage of the retrieved documents are relevant to the query. Recall - what percentage of the documents relevant to the query were retrieved. Measuring Retrieval Effectiveness Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Precision and Recall • Precision: a/(a+c) • Among all the retrieved, how many are actual positive? • Recall: a/(a+b) • Percentage of actual positive data retrieved • F measure: 2pr/(r+p) predict actual Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Training Data • Problem: which documents are actually relevant, and which are not • Usual solution: human judges • Create a corpus of documents and queries, with humans deciding which documents are relevant to which queries • TREC (Text REtrieval Conference) Benchmark Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Web Crawling • Web crawlers are programs that locate and gather information on the Web • Recursively follow hyperlinks present in known documents, to find other documents • Starting from a seed set of documents • Fetched documents • Handed over to an indexing system • Can be discarded after indexing, or store as a cached copy Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Storing related documents together in a library facilitates browsing users can see not only requested document but also related ones. Browsing is facilitated by classification system that organizes logically related documents together. Organization is hierarchical: classification hierarchy Browsing Yan Huang - CSCI5330 Database Implementation –Information Retrieval

A Classification Hierarchy For A Library System Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. Classification hierarchy is thus Directed Acyclic Graph (DAG) Classification DAG Yan Huang - CSCI5330 Database Implementation –Information Retrieval

A Classification DAG For A Library Information Retrieval System Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Web Directories • A Web directory is just a classification directory on Web pages • E.g. Yahoo! Directory, Open Directory project • Issues: • What should the directory hierarchy be? • Given a document, which nodes of the directory are categories relevant to the document • Often done manually • Classification of documents into a hierarchy may be done based on term similarity Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Some slides of this slide set adapted from the following slides: • Prof. James Allan’s course slides • Extrapolation Methods for Accelerating PageRank Computations by Sepandar D. Kamvar et. al. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

Information Retrieval