290 likes | 397 Views
CS 430: Information Discovery. Lecture 5 Ranking. Course Administration. • Optional course readings are optional. Read them if you wish. Some may require a visit to a library !
E N D
CS 430: Information Discovery Lecture 5 Ranking
Course Administration • Optional course readings are optional. Read them if you wish. Some may require a visit to a library! • Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours. • Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.
Course Administration Hints on Assignment 1 • You are not building a production system!!! • The volume of test data is quite small. Therefore Choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees). Consider batch loading of data (e.g., no need to provide for incremental update). User interface can be minimal (e.g., single letter commands). To save typing, we will provide the arrays char_class and convert_class from Frake Chapter 7.
Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.
Term Frequency Suppose term j appears fij times in document i Simple method (as illustrated in Lecture 4) is to use fij as the term frequency. Standard method Scale fij relative to the other terms in the document. This partially corrects for variations in the length of the documents. Let mi = max (fij) i.e., miis the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi i
Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.
Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is dj. Simple method is to use n/dj as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idfj = log2 (n/dj) + 1 dj > 0
Example of Inverse Document Frequency Example n = 1,000 documents term jdjidfj A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 From: Salton and McGill
Standard Version of tf.idf Weighting Combining tf and idf: (a) Weight is proportional to the number of times that the term appears in the document. (b) Weight is proportional to the logarithm of the reciprocal of the number of documents that contain the term. Notation wijis the weight given to term j in document i fij is the frequency with which term j appears in document i dj is the number of documents that contain term j miis the maximum frequency of any term in document i n is the total number of documents
Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard tf.idf weighting scheme is: wij = tfij *idfj = (fij / mi) * (log2 (n/dj) + 1) Frake, Chapter 14 discusses many variations on this basic scheme.
Ranking Based on Reference Patterns With term weighting (e.g., tf.idf) documents are ranked depending on how well they match a specific query. With ranking by reference patterns, documents are ranked based on the references among them. The ranking of a set of documents is independent of any specific query. In journal literature, references are called citations. On the web, references are called links or hyperlinks.
Citation Graph cites Paper is cited by Note that journal citations always refer to earlier work.
Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period
Graphical Analysis of Hyperlinks on the Web This page links to many other pages 2 1 4 Many pages link to this page 3 6 5
Matrix Representation Citing page (from) P1 P2 P3 P4 P5 P6 Number P1 1 1 P2 1 1 2 P3 1 1 1 3 P4 1 1 1 1 4 P5 1 1 P6 1 1 Cited page (to) Number 4 2 1 1 3 1
PageRank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.
Intuitive Model A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.
Basic Algorithm: Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 0.33 P2 0.25 1 P3 0.25 0.5 1 P40.25 0.5 0.33 1 P50.25 P6 0.33 = B Cited page Normalized link matrix Number 4 2 1 1 3 1
Basic Algorithm: Weighting of Pages Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = 0.33 1.25 1.75 2.08 0.25 0.33 1 1 1 1 1 1
Basic Algorithm: Iterate Iterate: wk = Bwk-1 w1 w2 w3 w4 ... converges to ...w -> -> -> -> -> -> 0.00 2.39 2.39 1.19 0.00 0.00 0.08 1.83 2.79 1.12 0.08 0.08 0.03 2.80 2.06 1.05 0.02 0.03 1 1 1 1 1 1 0.33 1.25 1.75 2.08 0.25 0.33
Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.
The PageRank Iteration The basic method iterates using the normalized link matrix, B. wk = Bwk-1 This w is the high order eigenvector of B Google iterates using a damping factor. The method iterates using a matrix B', where: B' = pN + (1 - p)B N is the matrix with every element equal to 1/n. p is a constant found by experiment.
Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages Ti pointing to it. • d: damping factor • C(A): number of links out of A Iterate until:
Information Retrieval Using PageRank Simple Method Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query
Reference Pattern Ranking using Dynamic Document Sets PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries. Concept. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections. With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.
Reference Pattern Ranking using Dynamic Document Sets Teoma Dynamic Ranking Algorithm (used in Ask Jeeves) 1. Search using conventional term weighting. Rank the hits using similarity between query and documents. 2. Select the highest ranking hits (e.g., top 5,000 hits). 3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query. 4. Display the results ranked in the order of the reference patterns calculated.
Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that share a term with the query vector. 2. The similarity, using conventional term weighting, between the query and document j is sj. 3. The rank of document j using PageRank or other reference pattern ranking is pj. 4. Calculate a combined rank cj = λsj + (1- λ)pj, where λ is a constant. 5. Display the hits ranked by cj. This method is used in several commercial systems, but the details have not been published.
Cornell Note Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).