170 likes | 294 Views
Random Walking on the World Wide Web Project Presentation. Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim. Introduction. Statistics about web-pages are very important Use a random sample of web pages to approximate: search engine coverage
E N D
Random Walking on the World Wide WebProject Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim
Introduction • Statistics about web-pages are very important • Use a random sample of web pages to approximate: • search engine coverage • domain name distribution (.com, .org, .edu) • average number of links in a page • average page length • The Goal : Develop a cheap method to sample uniformly from the Web
Random Walker • Random walk on a graph provides a sample of nodes • Graph is undirected and regular sample is uniform Problem: The Web is neither undirected nor regular • Incrementally create an undirected regular graph with the same nodes as the Web • Perform the walk on this graph
WebWalker 3 5 amazon.com 3 2 • Follow arandom out-link or a random in-linkat each step • Useweighted self loopsto even out pages’ degrees 3 0 4 netscape.com 0 1 4 3 3 2 1 1 3 2 2 2 w(v) = degmax - deg(v) 4
WebWalker • A random walk on a connected undirected regular graph converges to a uniform stationary distribution. • Pseudo code: Webwalker(v): - Spend expected degmax/deg(v) steps at v - Pick a random link incident to v (either v u or u v) Webwalker(u).
MD and MH Algorithms Maximum-Degree • The algorithm works by adding self loops to nodes. • Causing random walk to stay at these WebPages (nodes). • And by that fixing the bias in the trial distribution. Metropolis-Hastings • The Algorithm gives preference to smaller documents by reducing the step probability to large documents. • This fixes the bias caused by large documents with a large number of pareses.
Project description • Implement the WebWalker algorithm • Design a simulation frame work • Compare the results to the Search Based random walks from our previous project • Analyzing and displaying the results
Designing the Simulation Frame Work • Planning a series of simulations testing different parameters of the algorithms • Considering “bottlenecks” like the Yahoo daily query limit and H.D space. • Measuring the effect of each parameter on the algorithm • Running the simulations at the software lab on several computers at a time
Analysis Criteria • Similarity • Unique Hosts Visited • Final Similarity • Convergence