330 likes | 489 Views
Characterizing the Web Using Sampling Methods. Ed O’Neill Brian Lavoie OCLC Online Computer Library Center, Inc. Web Measurement, Metrics, and Mathematical Models Workshop WWW9 Conference. Nonprofit, membership, library computer service and research organization …
E N D
Characterizing the Web Using Sampling Methods Ed O’Neill Brian Lavoie OCLC Online Computer Library Center, Inc. Web Measurement, Metrics, and Mathematical Models Workshop WWW9 Conference
Nonprofit, membership, library computer service and research organization … • 9,000 member libraries world-wide • W3C Member • Cataloging, reference, resource sharing, and preservation services • Maintain and distribute the Dewey Decimal Classification
Roadmap • Web Characterization Project • Sampling the Web • Data Collection and Storage • Data Analysis
Ongoing project: 1997 to present • Answer basic questions about the Web: • How big is it? • What’s out there? • How is it evolving? • Focus on content, not network infrastructure • Help libraries cope with integrating Web content into their collections
Definitions: Web Objects • Sampling the Web requires clear and unambiguous definition of units • The organization of Web-accessible information suggests three object types: Web resource, Web page, Web site • Based on W3C Working Draft: http://www.w3.org/1999/05/WCA-terms/
Web Resource • An information object that: • is accessible from the Web (via HTTP) • is irreducible (finest level of meaningful granularity) • has an unambiguous identity (URI) • In practice, a Web resource is a file accessible from the Internet via HTTP
http://www.oclc.org/info.htm http://www.oclc.org/images/logo.gif http://www.oclc.org/applet.class Web Page • An aggregate object, consisting of one or more Web resources that are: • Collectively identified by a single URI • Rendered simultaneously as a single object
Web Site A collection of Web pages that … • reside at a single network location (IP address) • are interlinked: any of site’s Web pages can be accessed by: • following a sequence of hyperlink references • beginning at the site’s home page • spanning only Web pages residing at the same network location.
Sampling the Web • Objective:Collect representative Web sample • Methodology:Identify and collect random sample of Web sites — every Web site should have an equal probability of being included in the sample • Result:Random sample of Web sites; cluster sample of Web pages
Sampling Approach IP Address Space (4,294,967,296) Sampled addresses Allocated addresses HTTP hosts
Data Collection “Hello … Do you speak HTTP?” No response IP #1 “Yes … Welcome” HTTP Code = 200 Random IPs IP #2 “Yes … Go away” HTTP Code = 403 IP #3 Harvester
Polychrest Harvester • Java-based Web harvesting agent • Analyzes URI references in HTML markup to determine object type and extent • Currently analyzes following elements: <A> <FRAME> <INPUT> <AREA> <HEAD> <LINK> <BASE> <IFRAME> <SCRIPT> <BODY> <IMG>
URI Analysis • Two stages: (1) determine object type (2) filter on network location (if applicable) • Examples: Sample IP: 132.174.1.5 YES <A HREF=“http://www.oclc.org/page.htm”> <A HREF=“http://www.microsoft.com”> NO YES <IMG SRC=“oclc.gif”> <IMG SRC=“http://www.w3.org/w3.gif”> YES
Harvesting • Harvesting of a Web site is initiated immediately after it is identified • Polychrest understands Web object definitions for resources, pages, and sites • Web site extent determined by: • breadth-first search, using home page as root • follow internal Web page links only
Unique Web Sites • Not uncommon for a single Web site to be accessible from multiple IP addresses * • Sites at different IPs, but with identical content, are considered to be one logical site (often identified with a single domain name) • Creates bias in sample: greater probability of these sites being selected than sites associated with a single address
Filtering Rule A harvested IP is only considered a “hit” if ... … sample IP is “lowest” among all IPs associated with a given collection of Web pages) Example: 132.174.1..6 132.174.1.5 132.174.1.4 How can we identify sites with multiple IPs?
De-Duping Tests Domain-name-to-IP-address mapping: • for sites with domain names • resolve domain name to IP address; if sampled IP is lowest among returned IP(s), OK • Example: • Sample IP: 207.46.130.149 • Resolves to www.microsoft.com • www.microsoft.com resolves to: • 207.46.131.137 207.46.131.30 • 207.46.130.149 207.46.130.45 • 207.46.130.14
De-Duping Tests … Continued “Same-Octet” Test: • Harvest home page from IP addresses with same first three octets as sampled IP, but lower 4th octet Example: 132.174.1.5 132.174.1.4 132.174.1.3 132.174.1.2 132.174.1.1 132.174.1.0 • If any home page harvested from a lower 4th octet matches home page from sampled IP, filtering rule is failed
De-Duping Tests … Continued • Intra-Sample Duplicate Detection: • Identify sites within sample with identical content • Retain only site with lowest IP address • Unique Web Site: • Defined as any site identified in the sample that passes all three of the duplicate detection tests
Synopsis: 1999 Sample IP Addresses: 4,294,967,296 Sampled IPs (0.1%): 4,294,967 Connect to Port 80 for each sampled IP address • Web site identified if HTTP response code = 200 Sampled Web Sites: 4,882 • hit rate of about 1 out of a thousand Apply De-Duping Tests Sampled Unique Sites: 3,649
Network Security • Attempts to connect to random IP addresses have been viewed suspiciously by network administrators • like calling unlisted telephone numbers • Inquiries have been made about our activity (mostly cordial) * • For June 2000 Web sample: • assign separate IP and domain name to machine running harvester • run Web server with page explaining our project and supplying contact information
Data Storage • Polychrest stores data collected from a single Web site in one SGML-format archive file • Software splits archive file into separate file for manual viewing; links are localized • Harvested Site Example: 192.48.117.67.dmp * • For long-term storage, converting SGML into Internet Archive format
Site Growth (1,000) 1999 1999: 3.6 million 1998: 2 million 1997: 1.2 million
Web Site Types • Provisional site: serves only temporary or transitional pages (server templates, “under construction” pages, “site has moved” pages) • Private site: prohibits access explicitly (password, IP filter, firewall) or implicitly (site intended to be used by specific users) • Public site: provides unrestricted access to some portion of the site containing meaningful content
Types of Sites (1,000): 1999 Provisional: 1 million Private: 400,000 Public: 2.2 million
Accomplishments • Well-tested sampling methodology • Data collection and analysis tools • Innovative data analysis • Only consistent time-series (1998 - present) • Data available on request for scholarly use
Further Information... OCLC Online Computer Library Center, Inc.: http://www.oclc.org/ Web Characterization Project: http://www.oclc.org/oclc/research/projects/webstats/ E-mail: wcp-research@oclc.org
Web Publishing Patterns • Self-publishing: Web publishing patterns do not follow print model. Vast majority of Web sites exist to promote and disseminate information about site’s publisher. Unlike traditional print publishers, only a minority of Web publishers ‘sell’ information • Volatility: The Web is very volatile—less than half of the Web sites in the 1998 sample still existed when the 1999 sample was collected. Pages are even more volatile • Inaccessible: Less than half the Web sites have been indexed by the major search engines, even a lower proportion of the pages have been indexed
Emergence of Dark Matter • Dynamically generated information, usually in response to a query • Inaccessible to harvesters • Cannot be indexed * • Dark information appears to be more common in the latest sample
Site with Multiple IP Addresses 194.66.97.202, 194.66.99.88, 194.66.102.59, 194.66.110.112, 194.66.122.251, 194.66.123.63 These six IP addresses from the sample produced:
Example Responses (Edited) For the past two weeks or so, a host registered to you, has been sending network-scanning-like activity to port 80 of seemingly random IP addresses in our address space. I’m not sure the purpose of this activity but it appears to be in error. It appears innocuous enough; figured it would’ve stopped on its own by now. [Our] server has no restrictions on access, but as far as I know, there are no links to it on any other web sites or search engines, and we have told no one but our development partners about it. Therefore, I was surprised when I found [oclc] in the server's log files many times over the last several months. So...can you identify this user, how they found out about our server, and what their intentions are? If it is a user, we'd appreciate knowing who it is. We have noticed that an oclc server has been regularly checking a machine in our domain. Can you tell us why this server is interested in our little purple SGI?