1 / 18

Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory

Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory. Presented By Liang Tian. 7/13/2010. Adaptive On-Line Page Importance Computation. 1. Overview :. What is OPIC? Why Should we care ? Advantages vs off-line algorithms How does it work? Scenario of OPIC Challenge

urvi
Download Presentation

Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive On-Line Page Importance ComputationSerge, Mihai, Gregory Presented By Liang Tian 7/13/2010 Adaptive On-Line Page Importance Computation 1

  2. Overview : • What is OPIC? • Why Should we care ? • Advantages vs off-line algorithms • How does it work? • Scenario of OPIC • Challenge • Mathematical mode • Algorithm • Prons and Cons 7/13/2010 Adaptive On-Line Page Importance Computation 2

  3. What is OPIC? • OPIC stands for On-line Page Important Computation. Why should we care? • OPIC provide a more effective way of computing page importance than other old algorithms. 7/13/2010 Adaptive On-Line Page Importance Computation 3

  4. Advantages vs off-line algorithms • Work online with a large amount of dynamic graph • Use much less resources.eg.It does not require storing the link matrix • Can focus crawling to the most interest pages • fully integrated in the crawling process 7/13/2010 Adaptive On-Line Page Importance Computation 4

  5. How does it work? • It is on-line in that it continuously refines its estimate of page importance while the web graph is visited. 7/13/2010 Adaptive On-Line Page Importance Computation 5

  6. Scenario of OPIC • Initially, ditribute some cash to each page • Each page when it is crawled distributes its current cash equally to all pages it points to. • Record credit history of each page(when crawled, a page’s current cash sent to its children, but the cash amount it ever has record in the credit history ) • The page importance of one page= (credit history + current cash)/(total history amount+ total current cash) 7/13/2010 Adaptive On-Line Page Importance Computation 6

  7. Challenge How to find the values of current cash and history? Intuitively, the cash flow goes through from parent nodes to child nodes, in a inductive way. 7/13/2010 Adaptive On-Line Page Importance Computation 7

  8. Mathematical mode • Let G be any directed graph with n vertices. Fix an arbitrary ordering between the vertices. G can be represented as a matrix L[ i, j], such that L[i,j]>=0, L[i,j]>0 iff exist a edge from i to j • The basic idea is to define the importance of a page in an inductive way and then compute it using a fixpoint. • If the graph contains n nodes, the importance is represented as a vector ͞x in a n dimensional space 7/13/2010 Adaptive On-Line Page Importance Computation 8

  9. Mathematical mode (cont.) Importance is defined inductively by the equation Given a linear transformation A, a non-zero vector x is defined to be an eigenvector of the transformation if it satisfies the eigenvalue equation Ax=λx 7/13/2010 Adaptive On-Line Page Importance Computation 9

  10. Find a fixpoint • By definition, such a fixpoint is an eigenvector of L with a real positive eigenvalue. L͞x = λ͞x • Problems • Solution • Multiple solutions • Iteration may not converge • Google defines L[i,j]=1/d[i] iff there is an edge from i to j. • L’[i,j]=L[i,j]+ϵ ,where ϵ is a small real. • a new graph G’ which is G plus a small edge for any pair i,j • the convergence of iteration is guaranteed because this small edge makes G’ stongely connected and aperiodic 7/13/2010 Adaptive On-Line Page Importance Computation 10

  11. Algorithm for static graphs At each step, an estimate of any page k’s importance is (H[k]+C[k])/(G+1) 7/13/2010 Adaptive On-Line Page Importance Computation 11

  12. Crawling strategies There are two main strategies here: • Random : We choose the next page to crawl randomly with equal probability. • Greedy : We read next the page with highest cash. This is a greedy way to decrease the value of the error factor. Impact on convergence speed. 7/13/2010 Adaptive On-Line Page Importance Computation 12

  13. The Adaptive OPIC algorithm(for changing graphs) • Base on time window • two main dimensions • Fixed window • Variable Window • Interpolation • The page selection strategy that is used (e.g., Greedy or • Random ) • The window policy that is considered (e.g., Fixed • Window or Interpolation). 7/13/2010 Adaptive On-Line Page Importance Computation 13

  14. 7/13/2010 Adaptive On-Line Page Importance Computation 14

  15. Pros • it may start even when a (large) part of the matrix is still unknown • it is integrated in the crawling process • it works on-line even while the graph is being updated • It requires less storage resources than standard algorithms • It requires less CPU, memory and disk access than standard algorithms 7/13/2010 Adaptive On-Line Page Importance Computation 15

  16. Cons • it is strictly tailored to the computational cost model of crawling the Web • converges slower than others after reading the same pages 7/13/2010 Adaptive On-Line Page Importance Computation 16

  17. Reference • K. Bharat and A. Broder. Estimating the relative size andoverlap of public web search engines. 7th InternationalWorld Wide Web Conference (WWW7), 1998 • Andrei Z. Broder and al. Graph structure in the web.WWW9/Computer Networks, 2000. • S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, 1999. • J. Dean and M.R. Henzinger. Finding related pages in theworld wide web. 8th International World Wide WebConference, 1999. • Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The pagerank citation ranking: Bringing order to the web, 1998. • S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A firstexperience in archiving the french web. ECDL, 2002. 7/13/2010 Adaptive On-Line Page Importance Computation 17

  18. Q&A 7/13/2010 Adaptive On-Line Page Importance Computation 18

More Related