750 likes | 907 Views
CS 277: Data Mining Mining Web Link Structure. CIS 455/555: Internet and Web Systems. HITS and PageRank; Google March 27, 2013. Web search before 1998. Based on information retrieval Boolean / vector model, etc. Based purely on 'on-page' factors, i.e., the text of the page
E N D
CIS 455/555: Internet and Web Systems HITS and PageRank; Google March 27, 2013
Web search before 1998 • Based on information retrieval • Boolean / vector model, etc. • Based purely on 'on-page' factors, i.e., the text of the page • Results were not very good • Web doesn't have an editor to control quality • Web contains deliberately misleading information (SEO) • Great variety in types of information: Phone books, catalogs, technical reports, slide shows, ... • Many languages, partial descriptions, jargon, ... • How to improve the results?
Plan for today • HITS • Hubs and authorities • PageRank • Iterative computation • Random-surfer model • Refinements: Sinks and Hogs • Google • How Google worked in 1998 • Google over the years • SEOs NEXT
Goal: Find authoritative pages • Many queries are relatively broad • "cats", "harvard", "iphone", ... • Consequence: Abundance of results • There may be thousands or even millions of pages that contain the search term, incl. personal homepages, rants, ... • IR-type ranking isn't enough; still way too much for a human user to digest • Need to further refine the ranking! • Idea: Look for the most authoritative pages • But how do we tell which pages these are? • Problem: No endogenous measure of authoritativeness Hard to tell just by looking at the page. • Need some 'off-page' factors
Idea: Use the link structure • Hyperlinks encode a considerable amount of human judgment • What does it mean when a web page links another web page? • Intra-domain links: Often created primarily for navigation • Inter-domain links: Confer some measure of authority • So, can we simply boost the rank of pages with lots of inbound links?
Relevance Popularity! Team Sports “A-Team” page Yahoo Directory Wikipedia Cheesy TV Shows page Mr. T’s page Hollywood “Series to Recycle” page
Hubs and authorities • Idea: Give more weight to links from hub pages that point to lots of other authorities • Mutually reinforcing relationship: • A good hub is one that points to many good authorities • A good authority is one that is pointed to by many good hubs A B Hub Authority
HITS R S • Algorithm for a query Q: • Start with a root set R, e.g., the t highest-ranked pages from the IR-style ranking for Q • For each pR, add all the pages p points to, and up to d pages that point to p. Call the resulting set S. • Assign each page pS an authority weight xp and a hub weight yp; initially, set all weights to be equal and sum to 1 • For each pS, compute new weights xp and yp as follows: • New xp := Sum of all yq such that qp is an interdomain link • New yp := Sum of all xq such that pq is an interdomain link • Normalize the new weights such that both the sum of all the xp and the sum of all the yp are 1 • Repeat from step 4 until a fixpoint is reached • If A is adjacency matrix, fixpoints are principal eigenvectors ofATA and AAT, respectively
J. Kleinberg, Authorative sources in a hyperlinked environment, Proceedings of ACM SODA Conference, 1998. HITS – Hypertext Induced Topic Selection Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u]. Recursive quantitative definitions of hub and authority scores Relies on query-time processing To select base set Vq of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r \in R via an inbound or outbound edge (expanded set) To deduce hubs and authorities that exist in a sub-graph of the Web HITS: Hub and Authority Rankings
Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)
Authority and Hubness Convergence • Recursive dependency: • a(v) Σ h(w) • h(v) Σ a(w) w Є pa[v] w Є ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge
HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R A new set S (base subgraph)
HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights
Recap: HITS • Improves the ranking based on link structure • Intuition: Links confer some measure of authority • Overall ranking is a combination of IR ranking and this • Based on concept of hubs and authorities • Hub: Points to many good authorities • Authority: Is pointed to by many good hubs • Iterative algorithm to assign hub/authority scores • Query-specific • No notion of 'absolute quality' of a page; ranking needs to be computed for each new query
Plan for today • HITS • Hubs and authorities • PageRank • Iterative computation • Random-surfer model • Refinements: Sinks and Hogs • Google • How Google worked in 1998 • Google over the years • SEOs NEXT
Google's PageRank (Brin/Page 98) • A technique for estimating page quality • Based on web link graph, just like HITS • Like HITS, relies on a fixpoint computation • Important differences to HITS: • No hubs/authorities distinction; just a single value per page • Query-independent • Results are combined with IR score • Think of it as: TotalScore = IR score * PageRank • In practice, search engines use many other factors(for example, Google says it uses more than 200)
Shouldn't E's vote be worth more than F's? PageRank: Intuition A G • Imagine a contest for The Web's Best Page • Initially, each page has one vote • Each page votes for all the pages it has a link to • To ensure fairness, pages voting for more than one page must split their vote equally between them • Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round • In practice, it's a little more complicated - but not much! B H E I C How many levelsshould we consider? F J D
PageRank • Each page i is given a rank xi • Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it: Rank of page j Rank of page i Number of links out from page j How do we computethe rank values? Every page j that links to i
Iterative PageRank (simplified) Initialize all ranks tobe equal, e.g.: Iterate untilconvergence
Simple Example 1 2 3 4
Simple Example 1 1 2 3 0.5 0.5 0.5 0.5 0.5 0.5 4
Simple Example 1 1 2 3 0.5 0.5 0.5 0.5 0.5 Weight matrix W 0.5 4
Matrix-Vector form • Recall rj = importance of node j rj = Si wij rii,j = 1,….n e.g., r2 = 1 r1 + 0 r2 + 0.5 r3 + 0.5 r4 = dot product of r vector with column 2 of W Let r = n x 1 vector of importance values for the n nodes Let W = n x n matrix of link weights => we can rewrite the importance equations as r = WTr
Eigenvector Formulation Need to solve the importance equations for unknown r, with known W r = WTr We recognize this as a standard eigenvalue problem, i.e., A r = lr (where A = WT) with l = an eigenvalue = 1 and r = the eigenvector corresponding to l = 1
Eigenvector Formulation Need to solve for r in (WT – l I) r = 0 Note: W is a stochastic matrix, i.e., rows are non-negative and sum to 1 Results from linear algebra tell us that: (a) Since W is a stochastic matrix, W and WT have the same eigenvectors/eigenvalues (b) The largest of these eigenvalues l is always 1 (c) the vector r corresponds to the eigenvector corresponding to the largest eigenvector of W (or WT)
Solution for the Simple Example Solving for the eigenvector of W we get r = [0.2 0.4 0.133 0.2667] Results are quite intuitive, e.g., 2 is “most important” 1 1 2 3 0.5 0.5 W 0.5 0.5 0.5 0.5 4
Naïve PageRank Algorithm Restated • Let • N(p) = number outgoing links from page p • B(p) = number of back-links to page p • Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b)) • Page p’s importance is increased by the importance of its back set
In Linear Algebra formulation • Create an m x m matrix M to capture links: • M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links = 0 otherwise • Initialize all PageRanks to 1, multiply by M repeatedly until all values converge: • Computes principal eigenvector via power iteration
A Brief Example Google = * Amazon Yahoo Running for multiple iterations: = , , , … Total rank sums to number of pages
Oops #1 – PageRank Sinks Google = * Amazon Yahoo 'dead end' - PageRankis lost after each round Running for multiple iterations: = , , , … ,
Oops #2 – PageRank hogs Google = * Amazon Yahoo PageRank cannot flowout and accumulates Running for multiple iterations: = , , , … ,
Improved PageRank • Remove out-degree 0 nodes (or consider them to refer back to referrer) • Add decay factor d to deal with sinks • Typical value: d=0.85
Random Surfer Model • PageRank has an intuitive basis in random walks on graphs • Imagine a random surfer, who starts on a random page and, in each step, • with probability d, klicks on a random link on the page • with probability 1-d, jumps to a random page (bored?) • The PageRank of a page can be interpreted as the fraction of steps the surfer spends on the corresponding page • Transition matrix can be interpreted as a Markov Chain
Stopping the Hog Google = 0.85 + * Amazon Yahoo Running for multiple iterations: , , , … , , = … though does this seem right?
Search Engine Optimization (SEO) • Has become a big business • White-hat techniques • Google webmaster tools • Add meta tags to documents, etc. • Black-hat techniques • Link farms • Keyword stuffing, hidden text, meta-tag stuffing, ... • Spamdexing • Initial solution: <a rel="nofollow" href="...">...</a> • Some people started to abuse this to improve their own rankings • Doorway pages / cloaking • Special pages just for search engines • BMW Germany and Ricoh Germany banned in February 2006 • Link buying
Recap: PageRank • Estimates absolute 'quality' or 'importance' of a given page based on inbound links • Query-independent • Can be computed via fixpoint iteration • Can be interpreted as the fraction of time a 'random surfer' would spend on the page • Several refinements, e.g., to deal with sinks • Considered relatively stable • But vulnerable to black-hat SEO • An important factor, but not the only one • Overall ranking is based on many factors (Google: >200)
What could be the other 200 factors? • Note: This is entirely speculative! Links to 'bad neighborhood' Keyword stuffing Over-optimization Hidden content (text has same color as background) Automatic redirect/refresh ... Keyword in title? URL? Keyword in domain name? Quality of HTML code Page freshness Rate of change ... Fast increase in number of inbound links (link buying?) Link farming Different pages user/spider Content duplication ... High PageRank Anchor text of inbound links Links from authority sites Links from well-known sites Domain expiration date ... Source: Web Information Systems, Prof. Beat Signer, VU Brussels
Beyond PageRank • PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page • A more general notion: label propagation • Take a set of start nodes each with a different label • Estimate, for every node, the distribution of arrivals from each label • In essence, captures the relatedness or influence of nodes • Used in YouTube video matching, schema matching, …
Plan for today • HITS • Hubs and authorities • PageRank • Iterative computation • Random-surfer model • Refinements: Sinks and Hogs • Google • How Google worked in 1998 • Google over the years • SEOs NEXT
Google Architecture [Brin/Page 98] Focus was on scalabilityto the size of the Web First to really exploitLink Analysis Started as an academicproject @ Stanford;became a startup Our discussion will beon early Google – todaythey keep things secret!
The Heart of Google Storage • “BigFile” system for storing indices, tables • Support for 264 bytes across multiple drives, filesystems • Manages its own file descriptors, resources • This was the predecessor to GFS • First use: Repository • Basically, a warehouse of every HTML page (this is the 'cached page' entry), compressed in zlib (faster than bzip) • Useful for doing additional processing, any necessary rebuilds • Repository entry format:[DocID][ECode][UrlLen][PageLen][Url][Page] • The repository is indexed (not inverted here)
Repository Index • One index for looking up documents by DocID • Done in ISAM (think of this as a B+ Tree without smart re-balancing) • Index points to repository entries (or to URL entry if not crawled) • One index for mapping URL to DocID • Sorted by checksum of URL • Compute checksum of URL, then perform binary search by checksum • Allows update by merge with another similar file • Why is this done?
Lexicon • The list of searchable words • (Presumably, today it’s used to suggest alternative words as well) • The “root” of the inverted index • As of 1998, 14 million “words” • Kept in memory (was 256MB) • Two parts: • Hash table of pointers to words and the “barrels” (partitions) they fall into • List of words (null-separated)
Indices – Inverted and “Forward” • Inverted index divided into “barrels” (partitions by range) • Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document • Two barrels: short(anchor and title); full (all text) • Forward index uses the same barrels • Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
Hit Lists (Not Mafia-Related) • Used in inverted and forward indices • Goal was to minimize the size – the bulk of data is in hit entries • For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 font: 7 type: 4 position: 8 special-cased to: Anchor cap 1 font: 7 type: 4 hash: 4 pos: 4
Google’s Distributed Crawler • Single URL Server – the coordinator • A queue that farms out URLs to crawler nodes • Implemented in Python! • Crawlers had 300 open connections apiece • Each needs own DNS cache – DNS lookup is major bottleneck, as we have seen • Based on asynchronous I/O • Many caveats in building a “friendly” crawler (remember robot exclusion protocol?)
Theory vs. practice • Expect the unexpected • They accidentally crawled an online game • Huge array of possible errors: Typos in HTML tags, non-ASCII characters, kBs of zeroes in the middle of a tag, HTML tags nested hundreds deep, ... • Social issues • Lots of email and phone calls, since most people had not seen a crawler before: • "Wow, you looked at a lot of pages from my web site. How did you like it?" • "This page is copy-righted and should not be indexed" • ... • Typical of new services deployed "in the wild" • We had similar experiences with our ePOST system and our measurement study of broadband networks