330 likes | 350 Views
This lecture discusses techniques used in web search engines, including the concept of inverted index and the challenges in processing large matrices. It also explores problems such as spamming and the importance of pages based on link analysis. The lecture concludes with an explanation of PageRank and how it captures the importance of web pages.
E N D
Lecture 22SVD, Eigenvector, and Web Search Shang-Hua Teng
Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 …
Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota” all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first
Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda”
Observations • The “matrix” can be huge: • Now the Web has more than 10 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively.
Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled”
Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: a random page may not be as “important” as Yahoo’s main page. • A link from Yahoo is hence most likely more important than a link from that random page • But, how to capture the importance of a page? • A guess: # of hits? where to get that info? • # of inlinks to a page Google’s main idea.
PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to a page
Link Analysis • The goal is to rank pages • We want to take advantage of the link structure to do this • Two main approaches • Static: we will use the links to calculate a ranking of the pages offline (Google) • Dynamic: we will use the links in the results of a search to dynamically determine a ranking (IBM Clever – Huts and Authorities)
Ne MS Am The Link Graph • View documents as graph nodes and the hyperlinks between documents as directed edges • Can give weights on edges (links) based on • Position in the document • Weight of anchor term • Number of occurrences of link • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft.
Hyperlink analysis • Idea: Mine structure of the web graph • Related work: • Classic IR work (citations = links) a.k.a. “Bibliometrics” • Socio-metrics • Many Web related papers use this approach
Google’s approach • Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A) Quality of a page is related to its in-degree • Recursion: Quality of a page is related to • its in-degree, and to • the quality of pages linking to it PageRank[Brin and Page]
Intuition of PageRank • Consider the following infinite random walk (surf): • Initially the surfer is at a random page • At each step, the surfer proceeds • to a randomly chosen web page with probability a • to a randomly chosen successor of the current page with probability 1-a • The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.
PageRank: Formulation • PageRank = stationary probability for this random process (Markov chain), i.e. where n is the total number of nodes in the graph
PageRank: Matrix Formulation • Transition Matrix • Eigenvector of the Transition matrix
Example: MiniWeb a=0 • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their PageRank are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.
Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Ne MS Am
Observations • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
Problem 1 of algorithm: dead ends Ne • MS does not point to anybody • Result: weights of the Web “leak out” MS Am
Problem 2 of algorithm: spider traps Ne • MS only points to itself • Result: all weights go to MS! MS Am
Google’s Hack: setting a > 0“tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example.
Dynamic Ranking, Hubs and Authorities, IBM Clever • Goal: to get a ranking for a particular query (instead of the whole web). • Assume: We have a (set of) search engine(s) that can give a set of pages P that match a query.
Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: BMW, Toyota, Ford, … • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • Auto sale, ebay, www.ConsumerReports.org
Kleinberg • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs)
Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • h: hub values • a: authority values
HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph
Neighborhood graph Back Set Forward Set Query Results = Start Set • Subgraph associated to each query Result1 b1 f1 f2 b2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host
Step 2: computing h and a • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs
Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,hn). (3) (4)
Computing Hubs and Authorities(2) • In other words, a is an eigenvector of B: • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. (5) (6) (7) Let
Hubs and Authorities Hubs and authorities scores are the first singular vector of the matrix A
Example: MiniWeb Normalization! Ne Therefore: MS Am
Example: MiniWeb Ne MS Am