350 likes | 480 Views
Measuring the Size of the Web. Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State. Studying the Web. To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage. Nature 1999. Web as Platform.
E N D
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State
Studying the Web • To study the characteristics of the Web • Statistics • Topology • Behavior • … • Why • Scientific curiosity • Practical values • Eg, search engine coverage Nature 1999
Web as Platform • Web becomes a new computation platform • Pauses new challenges • Scale • Efficiency • Heterogeneity • Impact to People’s lives
Eg, How Big is the Web? • Q1: How many web sites? • Q2: How many web pages? • Q3: How many surface/deep web pages? • Research Method • Mostly used Experimental method to validate novel solutions
Q1: How Many Web Sites? • DNS Registrars • List of domain names • Issues • Not every domain is web site • A domain contains more than one web site • Registrars are under no obligations for their correctness • So many of them …
How Many Web Sites? • Brute-force: Polling every IP • IPv4: 256.256.256.256 • 2^32 = 4 billion • IPv6: 2^128 • 10 sec/IP, 1000 simultaneous connection: • 2^32*10/(1000*24*60*60) = 460 days • Not going to work !!
How Many Web Sites? • 2nd attempt: Sampling T: All 4 BillionIPs S: Sampled IPs V: Valid reply
How Many Web Sites? • Select |S| random IPs • Send HTTP requests to port 80 at the selected IPs • Count valid replies: “HTTP 200 OK” = |V| • |T| = 2^32 Q: What are the issues here?
Issues • Virtual hosting • Ports other than 80 • Temporarily unavailable sites • …
OCLC Survey (2002) • OCLC (Online Computer Library) Results http://wcp.oclc.org/ • Still room for growth (at least for Web sites) ??
NetCraft Web Server Survey (2010) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2010: 213,458,815 distinct sites http://news.netcraft.com/archives/category/web-server-survey/
NetCraft Web Server Survey (2013) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2013: 716,822,317 distinct sites http://news.netcraft.com/archives/category/web-server-survey/
NetCraft Web Server Survey (2014) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2013: 992,177,228 distinct sites http://news.netcraft.com/archives/category/web-server-survey/
T: All URLs S: Sampled URLs V: Valid reply Q2: How Many Web Pages? • Sampling based? • Issue here?
How Many Web Pages? • Method #1: • For each site with valid reply, download all pages • Measure average # of pages per site • Avg # of pages X total # of sites • Result [Lawrence & Giles, 1999] • 289 pages per site, 2.8M sites • 289 * 2.8M =~ 800M web pages
99.99% of the sites Further Issues • A small #of sites with TONS of pages • Sampling could miss these sites • Majority of sites with small # of pages • Lots of samples necessary
T: All pages B: Base set S: Random samples How Many Web Pages? • Method #2: Random sampling • Assume:
Random Page? • Idea: Random walk • Start from a Portal home page (eg, Yahoo) • Estimate the size of the portal: B • Follow random links, say 10,000 times • Select the pages • At the end, a set of random web pages S are gathered
Straightforward Random Walk amazon.com google.com Follow a random out-link at each step 4 7 1 6 9 3 5 8 2 Issues? pike.psu.edu
Straightforward Random Walk amazon.com google.com Follow a random out-link at each step 4 7 1 6 9 3 5 8 2 Issues? pike.psu.edu • Gets stuck in sinks and in dense Web communities • Biased towards popular pages • Converges slowly, if at all
Going to Converge? • Random walks on regular, undirected graph uniformly distributed sample • Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution • : depends on the graph structure • N: number of nodes • Idea: • Transform the Web graph to a regular, undirected graph • Perform a random walk • Problem • Web is neither regular nor undirected
Intuition • Random walk on undirected Web graph (not regular) • High chance to be at a “popular” node at a particular time • Increase the chance to be at a “unpopular” node by staying there longer through self loop. Popular node Unpopular nodes
WebWalker: Undirected Regular Random Walk on the Web Follow arandom out-link or a random in-linkat each step Useweighted self loopsto even out pages’ degrees 3 5 amazon.com 3 2 3 0 4 google.com 0 1 4 3 3 2 1 1 3 2 w(v) = degmax - deg(v) 2 2 4 pike.psu.edu Fact: A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.
Ideal Random Walk • Generate the regular, undirected graph: • Make edges undirected • Decide d the maximum # of edges per page:say, 300,000 • If edge(n) < 300,000, then add self-loop • Perform random walks on the graph • 10-5 for the 1996 Web, N 109
WebWalker Results (2000) • Size of the Web pages • Altavista: |B| = 250M • |BS|/|S| = 35% • Estimated |T| = ~ 720M • Avg page size: 12K • Avg # of out-links: 10 Ziv Bar-Yossef, Alexander Berg, Steve Chien, JittatFakcharoenphol, and DrorWeitz, Approximating Aggregate Queries about Web Pages via Random Walks. VLDB, 2000
How large is SE’s Index? • Prepare a representative corpus (eg, DMOZ) • Draw a word W with known frequency percentage F • Eg, “The” is present in 60% of all documents within the corpus • Submit W to a search engine E • If E reports there are X number of documents containing W, one can extrapolate the total size of E’s index as=~ X / F • Repeat multiple times for computing average
http://www.worldwidewebsize.com/ (2010) 28 Billions
http://www.worldwidewebsize.com/ (2011) 46 Billions
http://www.worldwidewebsize.com/ (2013) 46 Billions
http://www.worldwidewebsize.com/ (2013) 10 Billions
Google Reveals Itself (2008) • 1998: 26 Million URLs • 2000: 1 Billion URLs • 2008: 1 trillion URLs • Not all of them are indexed • Duplicates • Auto-generated (eg, Calendar) • Spams • Experts suspect (2010) • Google index at least 40 Billions
Deep Web (aka Hidden Web) HTML FORM Interface Query Answers
Q3: Size of Deep Web? • Deep Web: Information reachable only through query interface (eg, HTML FORM) • Often backed by DBMS • Estimation: • How to estimate? • By sampling (Avg size of record) X (Avg # of records per site) X (Total # of Deep Web sites)
Size of Deep Web? • Total # of Deep Web sites: • |BS|/|S| • Avg size of a record: • Issue random queries • Estimate reply size • Avg # of records per site: • Permute all possible queries for the FORM • Issue all queries and count valid return
Size of Deep Web (2005) • BrightPlanet report estimates: • Avg size of a record: 14KB • Avg # of records per site: 5MB • Total # of Deep Web sites: 200,000 • Size of the Deep Web: 10^16 (10 petabytes) • 1,000 times larger than the “Surface Web” • How to access it? • Wrapper/Mediator (aka. Web scrapping) http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now