Web Crawling

Web Crawling

Next week • I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. • My flight is due in PHL at 5:22 pm. • That is really tight to be here by 6:15 • May we have a delayed start to class: 7:00? • If something bad happens and I will be later than that, I will let you know by e-mail or a post on blackboard.

Web crawling – Why? • One form of gathering information. • We all know about information overload • Numbers are staggering • More is coming • The challenge of dealing with information, and data, will be with us for a long time. • There is more out there than we might immediately expect

Yotta Zetta Exa Peta Tera Giga Mega Kilo How much information is there? Everything Recorded ! Soon most everything will be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies All Books MultiMedia These require algorithms, data and knowledge representation, and knowledge of the domain All books (words) A movie See also, Mike Lesk:How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian:How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli A Book Slide source Jim Gray – Microsoft Research (modified)

Astronomy and Computing • The Large Synoptic Survey Telescope (LSST) Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade-long LSST sky survey. LSST and Google share many of the same goals: organizing massive quantities of data and making it useful. http://lsst.org/lsst/google

http://bits.blogs.nytimes.com/2012/04/16/daily-report-unanswered-questions-about-google/http://bits.blogs.nytimes.com/2012/04/16/daily-report-unanswered-questions-about-google/ Google and Information • From New York Times, April 16 2012 The Federal Communications Commission fined and censured Google for obstructing an inquiry into its Street View project, which had collected Internet communications from potentially millions of unknowing households as specially equipped cars drove slowly by. The data was a snapshot of what people were doing online at the moment the cars rolled by — e-mailing a lover, texting jokes to a buddy, balancing a checkbook, looking up an ailment. Google spent more than two years scooping up that information, from January 2008 to April 2010. "J. Trevor Hughes, president of the International Association of Privacy Professionals, said the Google case represented what happened when technical employees of technology companies made "innocent' decisions about collecting data that could infuriate consumers and in turn invite regulatory inquiry. "This is one of the most significant risks we see in the information age today," he said. "Project managers and software developers don't understand the sensitivity associated with data."

Ocean Observatories NEPTUNE Canada ocean network is part of the Ocean Networks Canada (ONC) Observatory. Our network extends the Internet from the rocky coast to the deep abyss. We gather live data and video from instruments on the seafloor, making them freely available to the world, 24/7. http://www.neptunecanada.ca/ Live video from the seafloor, more than 2 KM deep OOI Data PolAll OOI data including data from OOI core sensors and all proposed sensors added by Principal Investigators, will be rapidly disseminated, open, and freely available (within constraints of national security). Rapidly disseminated implies that data will be made available as soon as technically feasible, but generally in near real-time, with latencies as small as seconds for the cabled components. In limited cases, individual PIs who have developed a data source that becomes part of the OOI network may request exclusive rights to the data for a period of no more than one year from the onset of the data stream. http://www.oceanobservatories.org/about/frequently-asked-questions/

Crawling – the how • Agenda for tonight • The web environment • An architecture for crawling • Issues of politeness • Some technical assistance

First, What is Crawling A web crawler (aka a spider or a robot) is a program • Starts with one or more URL – the seed • Other URLs will be found in the pages pointed to by the seed URLs. They will be the starting point for further crawling • Uses the standard protocols for requesting a resource from a server • Requirements for respecting server policies • Politeness • Parses the resource obtained • Obtains additional URLs from the fetched page • Implements policies about duplicate content • Recognizes and eliminates duplicate or unwanted URLs • Adds found URLs to the queue and continues from the request to server step

An exercise • Go to any URL you frequently use • If you used that as a starting point for a crawl, how many pages could you get to if your crawl depth is 3 • That is, you go to each link on the original page, each link pointed to by those first links, and then each link pointed to by the next set. • As always, work in groups of 2 or 3 • Report just the number of links found

The Web Environment:Depth of the Web • A URL gives access to a web page. • That page may have links to other pages. • Some pages are generated only when information is provided through a form. • These pages cannot be discovered just by crawling. • The surface web is huge. • The deeper web is unfathomable.

Anatomy of a URL • http://www.csc.villanova.edu/~cassel • That is a pointer to a web page. • Three parts • http – the protocol to use for retrieving the page • other protocols, such as ftp can be used instead • www.csc.villanova.edu -- the name of the domain • csc is a subdomain of the villanova domain • ~cassel • Abbreviation subdirectory html in the directory cassel at the machine associated with www.csc.villanova.edu • index.html is the default page to return if no other file is specified

The major domain categories • Generic categories: • .net -- Originally restricted to major participants in maintaining the Internet. Now open. • .org -- Generally non profit organizations, including professional organizations such as acm.org • .com -- Commercial organizations such as amazon.com, etc. • .edu -- Restricted to higher education (post secondary) institutions. High schools and elementary schools are not allowed to use it. • .gov – Government organizations, such as nsf.gov • .mil – Military sites • Country Codes • .us Example: http://www.dot.state.pa.us/ PA Dept of Transportation • .it • .ukUses second level domains such as ac.uk or co.uk • And other country designations. Who is .tv? • Newer ones: .biz, .name, etc. • All regulated by the Internet Assigned Numbers Authoriity (IANA) Islands of Tuvalu

If not http:// then what? • Other protocols can be specified in the request to a server: • file:// local file on the current host • ftp:// use the ftp protocol to fetch the file • Etc.

Domain categories • The domain categories serve to partition the universe of domain names. • Domain Name Servers (DNS) do lookup to translate a domain name to an IP address. • An IP address locates a particular machine and makes a communication path known. • Most common still: 32 bit IPv4 addresses • Newer: 128 bit IPv6 (note next slide)

IPv6 note Last Updated: Tue Apr 17 00:45:18 2012 UTC Source:http://www.mrp.net/IPv6_Survey.html

Web servers • A server will typically have many programs running, several listening for network connections. • A port number (16 bits) identifies the specific process for the desired connection. • Default port for web connections: 80 • If other than 80, it must be specified in the URL

Exercise: What is where? • Your project is running on a specific server at a specific port. • Can you find the exact “address” of your project? • Use nslookup from a unix prompt (msdos also?) • example nslookupmonet.csc.villanova.edu returns Domain server Server: ns1.villanova.edu Address: 153.104.1.2 Name: monet.csc.villanova.edu Address: 153.104.202.173 Note, a local domain name server replied So the “phone number” of the apache server on monet is 153.104.202.173:80

Crawler features • A crawler must be • Robust: Survive spider traps. Websites that fool a spider into fetching large or limitless numbers of pages within the domain. • Some deliberate; some errors in site design • Polite: Crawlers can interfere with the normal operation of a web site. Servers have policies, both implicit and explicit, about the allowed frequency of visits by crawlers. Responsible crawlers obey these. Others become recognized and rejected outright. Ref: Manning Introduction to Information Retrieval

Crawler features • A crawler should be • Distributed: able to execute on multiple systems • Scalable: The architecture should allow additional machines to be added as needed • Efficient: Performance is a significant issue if crawling a large web • Useful: Quality standards should determine which pages to fetch • Fresh: Keep the results up-to-date by crawling pages repeatedly in some organized schedule • Extensible: Modular, well crafter architecture allows the crawler to expand to handle new formats, protocols, etc. Ref: Manning Introduction to Information Retrieval

Scale • A one month crawl of a billion pages requires fetching several hundred pages per second • It is easy to lose sight of the numbers when dealing with data sources on the scale of the Web. • 30 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute = 2,592,000 seconds • 1,000,000,000 pages/2,592,000 seconds = 385.8 pages/second • Note that those numbers assume that the crawling is continuous Ref: Manning Introduction to Information Retrieval

Google Search • See http://video.google.com/videoplay?docid=-1243280683715323550&hl=en# • Marissa Mayer of Google on how a search happens at Google.

Web Operation • Basic Client Server model • The http protocol • HyperText Transfer Protocol • Few simple commands that allow communication between the server and an application requesting something from the server – usually a browser, but not always. • Server • The site where the content resides. • Most of the web is served up by Apache and its byproducts. • Client • The program requesting something from the server. • Browsers most often, but also web crawlers and other applications.

HTTP: GET and POST • GET <path> HTTP/<version> • Requests that the server send the specific page at <path> back to the requestor. • The version number allows compatible communication • Server sends header and the requested file (page). • Additional requests can follow. • POST • Similar to a GET but allows additional information to be sent to the server. • Useful for purchases or page edits.

HEAD • HEAD <path> HTTP/<version> • Useful for checking whether a previously fetched web page has changed. • The request results in header information, but not the page itself. • Response: • Confirm http version compatibility • Date: • Server: • Last-Modified:

Full set of HTTP commands • CONNECT Command • DISCONNECT Command • GET Command • POST Command • HEAD Command • LOAD RESPONSE_INFO BODY Command • LOAD RESPONSE_INFO HEADER Command • SYNCHRONIZE REQUESTS Command

Search • Search engines, whether general engines like Google or Yahoo, or special purpose search engines in an application, do not crawl the web looking for results after receiving a query. • That would take much too long and provide unacceptable performance • Search engines actually search a carefully constructed database with indices created for efficiently locating content

Architecture of a Search Engine Ref: Manning Introduction to Information Retrieval

Crawling in Context • So, we see that crawling is just one step in a complex process of acquiring information from the Web to use in any application. • Usually, we will want to sort through the information we found to get the most relevant part for our use. So, the example of a search engine is relevant.

Making a request of a server • Browsers display pages by sending a request to a web server and receiving the coded page as a response. • Protocol: HTTP • http://abc.com/filea.html … means use the http protocol to communicate with the server at the location abc.com and fetch the file named filea.html • the html extension tells the browser to interpret the file contents as html code and display it.

Programming Language Help • Programming languages influence the kinds of problems that can be addressed easily. • Most languages can be used to solve a broad category of problems • but are more closely attuned to some kinds of problems • An example, • Python is very well suited to text analysis and has features useful in web crawling

Python module for web access urllib2 • Note – this is for Python 2.x, not Python 3 • Python 3 splits the urllib2 materials over several modules • import urllib2 • urllib2.urlopen(url [,data][, timeout]) • Establish a link with the server identified in the url and send either a GET or POST request to retrieve the page. • The optional data field provides data to send to the server as part of the request. If the data field is present, the HTTP request used is POST instead of GET • Use to fetch content that is behind a form, perhaps a login page • If used, the data must be encoded properly for including in an HTTP request. See http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1 • timeout defines time in seconds to be used for blocking operations such as the connection attempt. If it is not provided, the system wide default value is used. http://docs.python.org/library/urllib2.html

URL fetch and use • urlopen returns a file-like object with methods: • Same as for files: read(), readline(), fileno(), close() • New for this class: • info() – returns meta information about the document at the URL • getcode() – returns the HTTP status code sent with the response (ex: 200, 404) • geturl() – returns the URL of the page, which may be different from the URL requested if the server redirected the request

URL info • info() provides the header information that http returns when the HEAD request is used. • ex: >>> print mypage.info() Date: Mon, 12 Sep 2011 14:23:44 GMT Server: Apache/1.3.27 (Unix) Last-Modified: Tue, 02 Sep 2008 21:12:03 GMT ETag: "2f0d4-215f-48bdac23" Accept-Ranges: bytes Content-Length: 8543 Connection: close Content-Type: text/html

URL status and code >>> print mypage.getcode() 200 >>> print mypage.geturl() http://www.csc.villanova.edu/~cassel/

Python crawl example import urllib2 url = raw_input("Enter the URL of the page to fetch: ") try: linecount=0 page=urllib2.urlopen(url) result = page.getcode() if result == 200: for line in page: print line linecount+=1 print "Page Information \n ", page.info() print "Result code = ", page.getcode() print "Page contains ",linecount," lines." except: print "\nBad URL: ", url, "Did you include http:// ?" You almost certainly have a python interpreter on your machine. Copy and paste this and run it. Give it any url you want. Look at the results. file: url-fetch-try.py in pythonwork/classexamples

Basic Crawl Architecture WWW DNS URL set Doc FP’s robots filters Parse Fetch Content seen? URL filter Dup URL elim URL Frontier Ref: Manning Introduction to Information Retrieval

Crawler Architecture • Modules: • The URL frontier (the queue of URLs still to be fetched, or fetched again) • A DNS resolution module (The translation from a URL to a web server to talk to) • A fetch module (use http to retrieve the page) • A parsing module to extract text and links from the page • A duplicate elimination module to recognize links already seen Ref: Manning Introduction to Information Retrieval

Crawling threads • With so much space to explore, so many pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system.

Politeness • Not optional. • Explicit • Specified by the web site owner • What portions of the site may be crawled and what portions may not be crawled • robots.txt file • Implicit • If no restrictions are specified, still restrict how often you hit a single site. • You may have many URLs from the same site. Too much traffic can interfere with the site’s operation. Crawler hits are much faster than ordinary traffic – could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time.

Robots.txt Protocol nearly as old as the web See www.rototstxt.org/robotstxt.html File: URL/robots.txt • Contains the access restrictions • Example: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: All robots (spiders/crawlers) Robot named searchengineonly Nothing disallowed • Source: www.robotstxt.org/wc/norobots.html

Another example User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

Processing robots.txt • First line: • User-agent – identifies to whom the instruction applies. * = everyone; otherwise, specific crawler name • Disallow: or Allow: provides path to exclude or include in robot access. • Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site. • Just takes time, and uses up hits on the server • Cache the robots.txt file for repeated reference

Robots <META> tag • robots.txt provides information about access to a directory. • A given file may have an html meta tag that directs robot behavior • A responsible crawler will check for that tag and obey its direction. • Ex: • <META NAME=“ROBOTS” CONTENT = “INDEX, NOFOLLOW”> • OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW Seehttp://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2and http://www.robotstxt.org/meta.html

Crawling Which one? • Pick a URL from the frontier • Fetch the document at the URL • Parse the URL • Extract links from it to other docs (URLs) • Check if URL has content already seen • If not, add to indices • For each extracted URL • Ensure it passes certain URL filter tests • Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl .edu, obey robots.txt, etc. Ref: Manning Introduction to Information Retrieval

Recall: Basic Crawl Architecture WWW DNS URL set Doc FP’s robots filters Parse Fetch Content seen? URL filter Dup URL elim URL Frontier Ref: Manning Introduction to Information Retrieval

DNS – Domain Name Server • Internet service to resolve URLs into IP addresses • Distributed servers, some significant latency possible • OS implementations – DNS lookup is blocking – only one outstanding request at a time. • Solutions • DNS caching • Batch DNS resolver – collects requests and sends them out together Ref: Manning Introduction to Information Retrieval

Parsing • Fetched page contains • Embedded links to more pages • Actual content for use in the application • Extract the links • Relative link? Expand (normalize) • Seen before? Discard • New? • Meet criteria? Append to URL frontier • Does not meet criteria? Discard • Examine content

Content • Seen before? • How to tell? • Finger Print, Shingles • Documents identical, or similar • If already in the index, do not process it again Ref: Manning Introduction to Information Retrieval

Distributed crawler • For big crawls, • Many processes, each doing part of the job • Possibly on different nodes • Geographically distributed • How to distribute • Give each node a set of hosts to crawl • Use a hashing function to partition the set of hosts • How do these nodes communicate? • Need to have a common index Ref: Manning Introduction to Information Retrieval

Web Crawling

Web Crawling

Presentation Transcript

CRAWLING THE HIDDEN WEB

User-Centric Web Crawling

Crawling the Hidden Web

Web Crawling

Web Crawling

Web Crawling

Crawling the Hidden Web

CRAWLING THE WEB

Crawling the Hidden Web

Crawling the Hidden Web

Crawling the Hidden Web

Distributed Web Crawling over DHTs

Deep Web Crawling and Mining

Ch. 8: Web Crawling

Chapter 9 Web Crawling

Web Crawling and Data Gathering

Datahut - Web Crawling Services

Ch. 8: Web Crawling

Web Crawling and Automatic Discovery

Deep Web Crawling

User-Centric Web Crawling*