170 likes | 192 Views
Learn how web spiders operate, their applications, and components needed. Explore spider usage like search engines, data analysis, and automated interactions. Discover how to access the web and set up a socket connection.
E N D
Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000
What is a Web Spider? • A program that uses HTTP to automatically • download documents from a web server • analyze documents retrieved from a web server • send data back to a web server
Spider Usage • Search engines • Lycos analyzes 10,000,000 Web pages a day • Comparison shopping • ShopBot • Data analysis • bidding behavior at online auctions • Automated Web interactions • daily comics delivery • stock trading agent • Other (Mirroring, HTML/link validation, …)
How Humans Typically Access The Web • Web browser • human friendly interface • hides details of HTTP • Web browser is just a program written in some language • Whatever it does, you (a programmer) can do too!
What Components We Need to Use the Web socket connection + HTTP + page knowledge
Setting Up a Socket Connection • Programmatically (C, Perl, Java, Lisp, etc.) • Unix command prompt: > telnet addressport_number • address is web site address • default port_number for most web sites: 80 telnet http://www.netscape.com 80
HTTP • A well-defined specification for message formats • Orthogonal to: • TCP/IP • HTML • XML • W3C – World Wide Web Consortium • www.w3.org
Page Knowledge • Markup language: HTML, XML, free text • Data formatting • regular expressions • domain-specific conventions • freeform text • How to get the knowledge: • coded by humans • learning
telnet www.netscape.com 80 % telnet www.netscape.com 80 Trying 207.200.75.204... Connected to www-ld2.netscape.com. Escape character is '^]'. GET /index.html HTTP/1.0 User-Agent: An Evil Spider Accept: image/gif, */* Accept-Language: en, de Purpose-of-Request: Denial of Service Attack HTTP/1.1 200 OK Server: Netscape-Enterprise/3.6 Date: Thu, 10 Feb 2000 21:22:39 GMT Set-Cookie: UIDC=141.213.12.186:0950217760:031129;domain=.netscape.com;path=/; expires=31-Dec-2010 23:59:59 GMT Content-type: text/html Connection: close <HTML><HEAD><SCRIPT LANGUAGE=javascript><!-- Hide from old browsers if (parseFloat(navigator.appVersion) < 3) {document.write('<FRAMESET>'); location.href= "http://home.netscape.com/computing/download/upgrade_index.html";} // Stop Hiding From Old Browsers --></SCRIPT><!--REPLACE_START_QWEST5--> <TITLE>Netcenter</TITLE><!--REPLACE_END_QWEST5--> [[[ about 40k snipped ]]] <SCRIPT LANGUAGE="JavaScript1.1">window.pup=Pup();</SCRIPT></BODY></HTML> Connection closed by foreign host. %
Example of GET GET / HTTP/1.0^M Proxy-Connection: Keep-Alive^M User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u)^M Host: www.netscape.com^M Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*^M Accept-Encoding: gzip^M Accept-Language: de, en^M Accept-Charset: iso-8859-1,*,utf-8^M Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1^M
Example of GET-Based Form GET /lookup/Lookup.tibco?search=sunw&st_symbol=on HTTP/1.0 Referer: http://www.netscape.com/ Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: lookup.netscape.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1; NSPOP=|myn12
Example of POST-Based Form POST /~dreeves/bin/quote-submit.cgi HTTP/1.0 Referer: http://www.eecs.umich.edu/~dreeves/add-quote.html Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: www.eecs.umich.edu Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Content-type: application/x-www-form-urlencoded Content-length: 234 recipient=daniel&subject=QUOTE+DATABASE+SUBMISSION&name=eecs547+student &email=dreeves%40umich.edu &body=%22Beware+of+bugs+in+the+above+code %3B+I+have+only+proved+it+correct%2C+not+tried+it. %22%0D%0A++++++++++++++++--+Donald+Knuth%0D%0A
Basic Perl Web Library (web.pl) • getURLAsString • Given a URL, returns contents as string. • submitForm • Given a URL and a perl hash of HTML form fields and contents, submits the form and returns response. • html2text • Uses lynx to parse html into a reasonable text approximation.
Other Issues and Gotchas • SSL • Perl SSLeay library • Cookies • Perl libraries exist • robots.txt file • Sometimes for spider’s benefit • Politeness • Don’t get your domain blocked!
For More Information... All examples and links at: http://www.eecs.umich.edu/ ~dreeves/hdiw/main.html