330 likes | 335 Views
Lecture 22 SVD, Eigenvector, and Web Search. Shang-Hua Teng. Earlier Search Engines. Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … Main technique: “inverted index” Conceptually: use a matrix to represent how many times a term appears in one page
E N D
Lecture 22SVD, Eigenvector, and Web Search Shang-Hua Teng
Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 …
Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota” all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first
Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda”
Observations • The “matrix” can be huge: • Now the Web has more than 10 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively.
Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled”
Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: a random page may not be as “important” as Yahoo’s main page. • A link from Yahoo is hence most likely more important than a link from that random page • But, how to capture the importance of a page? • A guess: # of hits? where to get that info? • # of inlinks to a page Google’s main idea.
PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to a page
Link Analysis • The goal is to rank pages • We want to take advantage of the link structure to do this • Two main approaches • Static: we will use the links to calculate a ranking of the pages offline (Google) • Dynamic: we will use the links in the results of a search to dynamically determine a ranking (IBM Clever – Huts and Authorities)
Ne MS Am The Link Graph • View documents as graph nodes and the hyperlinks between documents as directed edges • Can give weights on edges (links) based on • Position in the document • Weight of anchor term • Number of occurrences of link • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft.
Hyperlink analysis • Idea: Mine structure of the web graph • Related work: • Classic IR work (citations = links) a.k.a. “Bibliometrics” • Socio-metrics • Many Web related papers use this approach
Google’s approach • Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A) Quality of a page is related to its in-degree • Recursion: Quality of a page is related to • its in-degree, and to • the quality of pages linking to it PageRank[Brin and Page]
Intuition of PageRank • Consider the following infinite random walk (surf): • Initially the surfer is at a random page • At each step, the surfer proceeds • to a randomly chosen web page with probability a • to a randomly chosen successor of the current page with probability 1-a • The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.
PageRank: Formulation • PageRank = stationary probability for this random process (Markov chain), i.e. where n is the total number of nodes in the graph
PageRank: Matrix Formulation • Transition Matrix • Eigenvector of the Transition matrix
Example: MiniWeb a=0 • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their PageRank are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.
Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Ne MS Am
Observations • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
Problem 1 of algorithm: dead ends Ne • MS does not point to anybody • Result: weights of the Web “leak out” MS Am
Problem 2 of algorithm: spider traps Ne • MS only points to itself • Result: all weights go to MS! MS Am
Google’s Hack: setting a > 0“tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example.
Dynamic Ranking, Hubs and Authorities, IBM Clever • Goal: to get a ranking for a particular query (instead of the whole web). • Assume: We have a (set of) search engine(s) that can give a set of pages P that match a query.
Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: BMW, Toyota, Ford, … • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • Auto sale, ebay, www.ConsumerReports.org
Kleinberg • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs)
Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • h: hub values • a: authority values
HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph
Neighborhood graph Back Set Forward Set Query Results = Start Set • Subgraph associated to each query Result1 b1 f1 f2 b2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host
Step 2: computing h and a • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs
Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,hn). (3) (4)
Computing Hubs and Authorities(2) • In other words, a is an eigenvector of B: • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. (5) (6) (7) Let
Hubs and Authorities Hubs and authorities scores are the first singular vector of the matrix A
Example: MiniWeb Normalization! Ne Therefore: MS Am
Example: MiniWeb Ne MS Am