1 / 17

Web Spiders

Web Spiders. Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000. What is a Web Spider?. A program that uses HTTP to automatically download documents from a web server analyze documents retrieved from a web server send data back to a web server. Spider Usage. Search engines

cady
Download Presentation

Web Spiders

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000

  2. What is a Web Spider? • A program that uses HTTP to automatically • download documents from a web server • analyze documents retrieved from a web server • send data back to a web server

  3. Spider Usage • Search engines • Lycos analyzes 10,000,000 Web pages a day • Comparison shopping • ShopBot • Data analysis • bidding behavior at online auctions • Automated Web interactions • daily comics delivery • stock trading agent • Other (Mirroring, HTML/link validation, …)

  4. How Humans Typically Access The Web • Web browser • human friendly interface • hides details of HTTP • Web browser is just a program written in some language • Whatever it does, you (a programmer) can do too!

  5. What Components We Need to Use the Web socket connection + HTTP + page knowledge

  6. Setting Up a Socket Connection • Programmatically (C, Perl, Java, Lisp, etc.) • Unix command prompt: > telnet addressport_number • address is web site address • default port_number for most web sites: 80 telnet http://www.netscape.com 80

  7. HTTP • A well-defined specification for message formats • Orthogonal to: • TCP/IP • HTML • XML • W3C – World Wide Web Consortium • www.w3.org

  8. Page Knowledge • Markup language: HTML, XML, free text • Data formatting • regular expressions • domain-specific conventions • freeform text • How to get the knowledge: • coded by humans • learning

  9. telnet www.netscape.com 80 % telnet www.netscape.com 80 Trying 207.200.75.204... Connected to www-ld2.netscape.com. Escape character is '^]'. GET /index.html HTTP/1.0 User-Agent: An Evil Spider Accept: image/gif, */* Accept-Language: en, de Purpose-of-Request: Denial of Service Attack HTTP/1.1 200 OK Server: Netscape-Enterprise/3.6 Date: Thu, 10 Feb 2000 21:22:39 GMT Set-Cookie: UIDC=141.213.12.186:0950217760:031129;domain=.netscape.com;path=/; expires=31-Dec-2010 23:59:59 GMT Content-type: text/html Connection: close <HTML><HEAD><SCRIPT LANGUAGE=javascript><!-- Hide from old browsers if (parseFloat(navigator.appVersion) < 3) {document.write('<FRAMESET>'); location.href= "http://home.netscape.com/computing/download/upgrade_index.html";} // Stop Hiding From Old Browsers --></SCRIPT><!--REPLACE_START_QWEST5--> <TITLE>Netcenter</TITLE><!--REPLACE_END_QWEST5--> [[[ about 40k snipped ]]] <SCRIPT LANGUAGE="JavaScript1.1">window.pup=Pup();</SCRIPT></BODY></HTML> Connection closed by foreign host. %

  10. Setting Up a Proxy in Netscape

  11. Example of GET GET / HTTP/1.0^M Proxy-Connection: Keep-Alive^M User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u)^M Host: www.netscape.com^M Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*^M Accept-Encoding: gzip^M Accept-Language: de, en^M Accept-Charset: iso-8859-1,*,utf-8^M Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1^M

  12. Example of GET-Based Form GET /lookup/Lookup.tibco?search=sunw&st_symbol=on HTTP/1.0 Referer: http://www.netscape.com/ Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: lookup.netscape.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1; NSPOP=|myn12

  13. Example of POST-Based Form POST /~dreeves/bin/quote-submit.cgi HTTP/1.0 Referer: http://www.eecs.umich.edu/~dreeves/add-quote.html Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: www.eecs.umich.edu Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Content-type: application/x-www-form-urlencoded Content-length: 234 recipient=daniel&subject=QUOTE+DATABASE+SUBMISSION&name=eecs547+student &email=dreeves%40umich.edu &body=%22Beware+of+bugs+in+the+above+code %3B+I+have+only+proved+it+correct%2C+not+tried+it. %22%0D%0A++++++++++++++++--+Donald+Knuth%0D%0A

  14. Basic Perl Web Library (web.pl) • getURLAsString • Given a URL, returns contents as string. • submitForm • Given a URL and a perl hash of HTML form fields and contents, submits the form and returns response. • html2text • Uses lynx to parse html into a reasonable text approximation.

  15. Example: Get Today’s Dilbert and Package it for Email

  16. Other Issues and Gotchas • SSL • Perl SSLeay library • Cookies • Perl libraries exist • robots.txt file • Sometimes for spider’s benefit • Politeness • Don’t get your domain blocked!

  17. For More Information... All examples and links at: http://www.eecs.umich.edu/ ~dreeves/hdiw/main.html

More Related