170 likes | 192 Views
Web Spiders. Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000. What is a Web Spider?. A program that uses HTTP to automatically download documents from a web server analyze documents retrieved from a web server send data back to a web server. Spider Usage. Search engines
E N D
Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000
What is a Web Spider? • A program that uses HTTP to automatically • download documents from a web server • analyze documents retrieved from a web server • send data back to a web server
Spider Usage • Search engines • Lycos analyzes 10,000,000 Web pages a day • Comparison shopping • ShopBot • Data analysis • bidding behavior at online auctions • Automated Web interactions • daily comics delivery • stock trading agent • Other (Mirroring, HTML/link validation, …)
How Humans Typically Access The Web • Web browser • human friendly interface • hides details of HTTP • Web browser is just a program written in some language • Whatever it does, you (a programmer) can do too!
What Components We Need to Use the Web socket connection + HTTP + page knowledge
Setting Up a Socket Connection • Programmatically (C, Perl, Java, Lisp, etc.) • Unix command prompt: > telnet addressport_number • address is web site address • default port_number for most web sites: 80 telnet http://www.netscape.com 80
HTTP • A well-defined specification for message formats • Orthogonal to: • TCP/IP • HTML • XML • W3C – World Wide Web Consortium • www.w3.org
Page Knowledge • Markup language: HTML, XML, free text • Data formatting • regular expressions • domain-specific conventions • freeform text • How to get the knowledge: • coded by humans • learning
telnet www.netscape.com 80 % telnet www.netscape.com 80 Trying 207.200.75.204... Connected to www-ld2.netscape.com. Escape character is '^]'. GET /index.html HTTP/1.0 User-Agent: An Evil Spider Accept: image/gif, */* Accept-Language: en, de Purpose-of-Request: Denial of Service Attack HTTP/1.1 200 OK Server: Netscape-Enterprise/3.6 Date: Thu, 10 Feb 2000 21:22:39 GMT Set-Cookie: UIDC=141.213.12.186:0950217760:031129;domain=.netscape.com;path=/; expires=31-Dec-2010 23:59:59 GMT Content-type: text/html Connection: close <HTML><HEAD><SCRIPT LANGUAGE=javascript><!-- Hide from old browsers if (parseFloat(navigator.appVersion) < 3) {document.write('<FRAMESET>'); location.href= "http://home.netscape.com/computing/download/upgrade_index.html";} // Stop Hiding From Old Browsers --></SCRIPT><!--REPLACE_START_QWEST5--> <TITLE>Netcenter</TITLE><!--REPLACE_END_QWEST5--> [[[ about 40k snipped ]]] <SCRIPT LANGUAGE="JavaScript1.1">window.pup=Pup();</SCRIPT></BODY></HTML> Connection closed by foreign host. %
Example of GET GET / HTTP/1.0^M Proxy-Connection: Keep-Alive^M User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u)^M Host: www.netscape.com^M Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*^M Accept-Encoding: gzip^M Accept-Language: de, en^M Accept-Charset: iso-8859-1,*,utf-8^M Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1^M
Example of GET-Based Form GET /lookup/Lookup.tibco?search=sunw&st_symbol=on HTTP/1.0 Referer: http://www.netscape.com/ Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: lookup.netscape.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1; NSPOP=|myn12
Example of POST-Based Form POST /~dreeves/bin/quote-submit.cgi HTTP/1.0 Referer: http://www.eecs.umich.edu/~dreeves/add-quote.html Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: www.eecs.umich.edu Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Content-type: application/x-www-form-urlencoded Content-length: 234 recipient=daniel&subject=QUOTE+DATABASE+SUBMISSION&name=eecs547+student &email=dreeves%40umich.edu &body=%22Beware+of+bugs+in+the+above+code %3B+I+have+only+proved+it+correct%2C+not+tried+it. %22%0D%0A++++++++++++++++--+Donald+Knuth%0D%0A
Basic Perl Web Library (web.pl) • getURLAsString • Given a URL, returns contents as string. • submitForm • Given a URL and a perl hash of HTML form fields and contents, submits the form and returns response. • html2text • Uses lynx to parse html into a reasonable text approximation.
Other Issues and Gotchas • SSL • Perl SSLeay library • Cookies • Perl libraries exist • robots.txt file • Sometimes for spider’s benefit • Politeness • Don’t get your domain blocked!
For More Information... All examples and links at: http://www.eecs.umich.edu/ ~dreeves/hdiw/main.html