430 likes | 535 Views
Estimating the Global PageRank of Web Communities. Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4 th 2006. Localized Search Engines.
E N D
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4th 2006
Localized Search Engines • What are they? • Focus on a particular community • Examples: www.cs.kent.edu (site specific) or all computer science related websites (topic specific) • Advantages • Searching for particular terms with several meanings • Relatively inexpensive to build and use • Use less bandwidth, space and time • Local domains are orders of magnitude smaller than global domain
Localized Search Engines (con’t) • Disadvantages • Lack of Global information • i.e. only local PageRanks are available • Why is this a problem? • Only pages within that community that are highly regarded will have high PageRanks • There is a need for a global PageRank for pages only within a local domain • Traditionally, this can only be obtained by crawling entire domain
Some Global Facts • 2003 Study by Lyman on the Global Domain • 8.9 billion pages on the internet (static pages) • Approximately 18.7 kilobytes each • 167 terabytes needed to download and crawl the entire web • These resources are only available to major corporations • Local Domains • May only contain a couple hundred thousand pages • May already be contained on a local web server (www.cs.kent.edu) • There is much less restriction to the entire dataset • The advantages of localized search engines becomes clear
Global (N) vs. Local (n) Each local domain isn’t aware of the rest of the global domain. Some parts overlap, but others don’t. Overlap represents links to other domains. How is it possible to extract global information when only the local domain is available? Excluding overlap from other domains gives a very poor estimate of global rank.
Proposed Solution • Find a good approximation to the global PageRank value without crawling entire global domain • Find a superdomain of local domain that will well approximate the PageRank • Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages • Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain
PageRank - Description • Defines importance of pages based on the hyperlinks from one page to another (the web graph) • Computes the stationary distribution of a Markov chain created from the web graph • Uses the “random surfer” model to create a “random walk” over the chain
PageRank Matrix • Given m x m adjacency matrix for the web graph, define the PageRank Matrix as • DU is diagonal matrix such that UDU-1 is column stochastic • 0 ≤ α ≤ 1 • e is vector of all 1’s • v is the random surfer vector
PageRank Vector • The PageRank vector r represents the page rank of every node in the webgraph • It is defined as the dominate eigenvector of the PageRank matrix • Computed using the power method using a random starting vector • Computation can take as much as O(m2) time for a dense graph but in practice is normally O(km), k being the average number of links per page
Algorithm 1 • Computing the PageRank vector based on the adjacency matrix U of the given web graph
Algorithm 1 (Explanation) • Input: Adjacency Matrix U • Output: PageRank vector r • Method • Choose a random initial value for r(0) • Continue to iterate using the random surfer probability and vector until reaching the convergence threshold • Return the last iteration as the dominant eigenvector for adjacency matrix U
Defining the Problem ( G vs. L) • For a local domain L, we have G as the entire global domain with an N x N adjacency matrix • Define G to be as the following • i.e. we partition G into separate sections that allow L to be contained • Assume that L has already been crawled and Lout is known
Defining the Problem (p* in g) • If we partition G as such, we can denote actual PageRank vector of L as with respect to g (the global PageRank vector) Note: EL selects only the nodes that correspond to L from g
Defining the Problem (n << N) • We define p as the PageRank vector computed by crawling only local domain L • Note that p will be much different than p* • Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible • Find the supergraph F of L that will minimize the difference between p and p*
Defining the Problem (finding F) • We need to find F that gives us the best approximation of p* • i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) • F is found with a greedy strategy, using Algorithm 2 • Essentially, start with L and add the nodes in Fout that minimize our objective and continue doing so a total of T iterations
Algorithm 2 (Explanation) • Input: L (local domain), Lout (outlinks from L), T (number of iterations), k (pages to crawl per iteration) • Output: p (an improved estimated PageRank vector) • Method • First set F (supergraph) and Fout equal to L and Lout • Compute the PageRank vector of F • While T has not been exceeded • Select k new nodes to crawl based on F, Fout, f • Expand F to include those new nodes and modify Fout • Compute the new PageRank vector for F • Select the elements from f that correspond to L and return p
Global (N) vs. Local (n) (Again) We know how to create the PageRank vector using the power method. Using it on only the local domain gives very inaccurate estimates of the PageRank. How can we select nodes from other domains (i.e. expanding the current domain) to improve accuracy? How far can selecting more nodes be allowed to proceed without crawling the entire global domain?
Selecting Nodes • Select nodes to expand L to F • Selected nodes must bring us closer to the actual PageRank vector • Some nodes will greatly influence the current PageRank • Only want to select at most O(n) more pages than those already in L
Finding the Best Nodes • For a page j in the global domain and the frontier of F (Fout), the addition of page j to F is as follows • uj is the outlinks from F to j • s is the estimated inlinks from j into F (j has not yet been crawled) • s is estimated based on the expectation of inlink counts of pages already crawled as so
Finding the Best Nodes (con’t) • We defined the PageRank of F to be f • The PageRank of Fj is fj+ • xj is the PageRank of node j (added to the current PageRank vector) • Directly optimizing requires us to know the global PageRank p* • How can we minimize the objective without knowing p*?
Node Influence • Find the nodes in Fout that will have the greatest influence on the local domain L • Done by attaching an influence score to each node j • Summation of the difference adding page j will make to PageRank vector among all pages in L • The influence score has a strong corollary to the minimization of the GlobalDiff(fj) function (as compared to a baseline, for instance, the total outlink count from F to node j)
Node Influence Results • Node Influence vs. Outlink Count on a crawl of conservative web sites
Finding the Influence • Influence must be calculated for each node j in frontier of F that is considered • We are considering O(n) pages and the calculation is O(n), we are left with a O(n2) computation • To reduce this complexity, approximating the influence of j may be acceptable, but how? • Using the power method for computing the PageRank algorithms may lead us to a good approximation • However, using the algorithm (Algorithm 1), requires having a good starting vector
PageRank Vector (again) • The PageRank algorithm will converge at a rate equal to the random surfer probability α • With a starting vector x(0), the complexity of the algorithm is • That is, the more accurate the vector becomes, the more complex the process is • Saving Grace: Find a very good starting vector for x(0), in which case we only need to perform one iteration of Algorithm 1
Finding the Best x(0) • Partition the PageRank matrix for Fj
Finding the Best x(0) • Simple approach • Use as the starting vector (the current PageRank vector) • Perform one PageRank iteration • Remove the element that corresponds to added node • Issues • The estimate of fj+ will have an error of at least 2αxj • So if the PageRank of j is very high, very bad estimate
Stochastic Complement • In an expanded form, the PageRank fj+ is • Which can be solved as • Observation: • This is the stochastic complement of PageRank matrix of Fj
Stochastic Complement (Observations) • The stochastic complement of an irreducible matrix is unique • The stochastic complement is also irreducible and therefore has unique stationary distribution • With regards to the matrix S • The subdominant eigenvalue is at most which means that for large l, it is very close to α
The New PageRank Approximation • Estimate the vector fj of length l by performing one PageRank iteration over S, starting at f • Advantages • Starting and ending with a vector of length l • Creates a lower bound for error of zero • Example: Considering adding a node k to F that has no influence over the PageRank of F • Using the stochastic complement yields the exact solution
The Details • Begin by expanding the difference between two PageRank vectors • with
The Details • Substitute PF into the equation • Summarizing into vectors
Algorithm 3 (Explanation) • Input: F (the current local subgraph), Fout (outlinks of F), f (current PageRank of F), k (number of pages to return) • Output: k new pages to crawl • Method • Compute the outlink sums for each page in F • Compute a scalar for every known global page j (how many pages link to j) • Compute y and z as formulated • For each of the pages in Fout • Computer x as formulated • Compute the score of each page using x, y and z • Return the k pages with the highest scores
PageRank Leaks and Flows • The change of a PageRank based on added a node j to F can be described as Leaks and Flows • A flow is the increase in local PageRanks • Represented by • Scalar is the total amount j has to distribute • Vector determines how it will be distributed • A leak is the decrease in local PageRanks • Leaks come from non-positive vectors x and y • X is proportional to the weighted sum of sibling PageRanks • Y is an artifact of the random surfer vector
Leaks and Flows J Leaks Random Surfer Siblings Local Pages Flows
Experiments • Methodology • Resources are limited, global graph is approximated • Baseline Algorithms • Random • Nodes chosen uniformly at random from known global nodes • Outlink Count • Node chosen have the highest number of outline counts from the current local domain
Results (Data Sets) • Data Set • Restricted to http pages that do not contain the characters ?, *, @, or = • EDU Data Set • Crawl of the top 100 computer science universities • Yielded 4.7 million pages, 22.9 million links • Politics Data Set • Crawl of the pages under politics in dmoz directory • Yielded 4.4 million pages, 17.2 million links
Results (EDU Data Set) • Normalizations show difference, Kendall shows similarity
Result Summary • Stochastic Complement outperformed other methods in nearly every trial • The results are significantly better than the random walk approach with minimal computation
Conclusion • Accurate estimates of the PageRank can be obtained by using local results • Expand the local graph based on influence • Crawl at most O(n) more pages • Use stochastic complement to accurately estimate the new PageRank vector • Not computationally or storage intensive
Estimating the Global PageRank of Web Communities The End Thank You