190 likes | 409 Views
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory. Presented By Liang Tian. 7/13/2010. Adaptive On-Line Page Importance Computation. 1. Overview :. What is OPIC? Why Should we care ? Advantages vs off-line algorithms How does it work? Scenario of OPIC Challenge
E N D
Adaptive On-Line Page Importance ComputationSerge, Mihai, Gregory Presented By Liang Tian 7/13/2010 Adaptive On-Line Page Importance Computation 1
Overview : • What is OPIC? • Why Should we care ? • Advantages vs off-line algorithms • How does it work? • Scenario of OPIC • Challenge • Mathematical mode • Algorithm • Prons and Cons 7/13/2010 Adaptive On-Line Page Importance Computation 2
What is OPIC? • OPIC stands for On-line Page Important Computation. Why should we care? • OPIC provide a more effective way of computing page importance than other old algorithms. 7/13/2010 Adaptive On-Line Page Importance Computation 3
Advantages vs off-line algorithms • Work online with a large amount of dynamic graph • Use much less resources.eg.It does not require storing the link matrix • Can focus crawling to the most interest pages • fully integrated in the crawling process 7/13/2010 Adaptive On-Line Page Importance Computation 4
How does it work? • It is on-line in that it continuously refines its estimate of page importance while the web graph is visited. 7/13/2010 Adaptive On-Line Page Importance Computation 5
Scenario of OPIC • Initially, ditribute some cash to each page • Each page when it is crawled distributes its current cash equally to all pages it points to. • Record credit history of each page(when crawled, a page’s current cash sent to its children, but the cash amount it ever has record in the credit history ) • The page importance of one page= (credit history + current cash)/(total history amount+ total current cash) 7/13/2010 Adaptive On-Line Page Importance Computation 6
Challenge How to find the values of current cash and history? Intuitively, the cash flow goes through from parent nodes to child nodes, in a inductive way. 7/13/2010 Adaptive On-Line Page Importance Computation 7
Mathematical mode • Let G be any directed graph with n vertices. Fix an arbitrary ordering between the vertices. G can be represented as a matrix L[ i, j], such that L[i,j]>=0, L[i,j]>0 iff exist a edge from i to j • The basic idea is to define the importance of a page in an inductive way and then compute it using a fixpoint. • If the graph contains n nodes, the importance is represented as a vector ͞x in a n dimensional space 7/13/2010 Adaptive On-Line Page Importance Computation 8
Mathematical mode (cont.) Importance is defined inductively by the equation Given a linear transformation A, a non-zero vector x is defined to be an eigenvector of the transformation if it satisfies the eigenvalue equation Ax=λx 7/13/2010 Adaptive On-Line Page Importance Computation 9
Find a fixpoint • By definition, such a fixpoint is an eigenvector of L with a real positive eigenvalue. L͞x = λ͞x • Problems • Solution • Multiple solutions • Iteration may not converge • Google defines L[i,j]=1/d[i] iff there is an edge from i to j. • L’[i,j]=L[i,j]+ϵ ,where ϵ is a small real. • a new graph G’ which is G plus a small edge for any pair i,j • the convergence of iteration is guaranteed because this small edge makes G’ stongely connected and aperiodic 7/13/2010 Adaptive On-Line Page Importance Computation 10
Algorithm for static graphs At each step, an estimate of any page k’s importance is (H[k]+C[k])/(G+1) 7/13/2010 Adaptive On-Line Page Importance Computation 11
Crawling strategies There are two main strategies here: • Random : We choose the next page to crawl randomly with equal probability. • Greedy : We read next the page with highest cash. This is a greedy way to decrease the value of the error factor. Impact on convergence speed. 7/13/2010 Adaptive On-Line Page Importance Computation 12
The Adaptive OPIC algorithm(for changing graphs) • Base on time window • two main dimensions • Fixed window • Variable Window • Interpolation • The page selection strategy that is used (e.g., Greedy or • Random ) • The window policy that is considered (e.g., Fixed • Window or Interpolation). 7/13/2010 Adaptive On-Line Page Importance Computation 13
7/13/2010 Adaptive On-Line Page Importance Computation 14
Pros • it may start even when a (large) part of the matrix is still unknown • it is integrated in the crawling process • it works on-line even while the graph is being updated • It requires less storage resources than standard algorithms • It requires less CPU, memory and disk access than standard algorithms 7/13/2010 Adaptive On-Line Page Importance Computation 15
Cons • it is strictly tailored to the computational cost model of crawling the Web • converges slower than others after reading the same pages 7/13/2010 Adaptive On-Line Page Importance Computation 16
Reference • K. Bharat and A. Broder. Estimating the relative size andoverlap of public web search engines. 7th InternationalWorld Wide Web Conference (WWW7), 1998 • Andrei Z. Broder and al. Graph structure in the web.WWW9/Computer Networks, 2000. • S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, 1999. • J. Dean and M.R. Henzinger. Finding related pages in theworld wide web. 8th International World Wide WebConference, 1999. • Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The pagerank citation ranking: Bringing order to the web, 1998. • S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A firstexperience in archiving the french web. ECDL, 2002. 7/13/2010 Adaptive On-Line Page Importance Computation 17
Q&A 7/13/2010 Adaptive On-Line Page Importance Computation 18