1 / 33

Introduction to Web Robots, Crawlers & Spiders

Introduction to Web Robots, Crawlers & Spiders. Instructor: Joseph DiVerdi, Ph.D., MBA. Web Robot Defined. A Web Robot Is a Program That Automatically Traverses the Web Using Hypertext Links Retrieving a Particular Document Then Retrieving All Documents That Are Referenced Recursively

Download Presentation

Introduction to Web Robots, Crawlers & Spiders

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph DiVerdi, Ph.D., MBA

  2. Web Robot Defined • A Web Robot Is a Program • That Automatically Traverses the Web • Using Hypertext Links • Retrieving a Particular Document • Then Retrieving All Documents That Are Referenced • Recursively • Recursive Doesn't Limit the Definition • To Any Specific Traversal Algorithm • Even If a Robot Applies Some Heuristic to the Selection & Order of Documents to Visit & Spaces Out Requests Over a Long Time Period • It Is Still a Robot

  3. Web Robot Defined • Normal Web Browsers Are Not Robots • Because the Are Operated by a Human • Don't Automatically Retrieve Referenced Documents • Other Than Inline Images

  4. Web Robot Defined • Sometimes Referred to As • Web Wanderers • Web Crawlers • Spiders • These Names Are a Bit Misleading • They Give the Impression the Software Itself Moves Between Sites • Like a Virus • This Not the Case • A Robot Visits Sites by Requesting Documents From Them

  5. Agent Defined • The Term Agent Is (Over) Used These Days • Specific Agents Include: • Autonomous Agent • Intelligent Agent • User-Agent

  6. Autonomous Agent Defined • An Autonomous Agent Is a Program • That Automatically Travels Between Sites • Makes Its Own Decisions • When To Move, When To Stay • Are Limited to Travel Between Selected Sites • Currently Not Widespread on the Web

  7. Intelligent Agent Defined • An Intelligent Agent Is a Program • That Helps Users With Certain Activities • Choosing a Product • Filling Out a Form • Find Particular Items • Generally Have Little to Do With Networking • Usually Created & Maintained by an Organization • To Assist Its Own Viewers

  8. User-Agent Defined • An User-Agent Is a Program • Performs Networking Tasks for a User • Web User-Agent • Navigator • Internet Explorer • Opera • Email User-Agent • Eudora • FTP User-Agent • HTML-Kit • Fetch • cute_FTP

  9. Search Engine Defined • A Search Engine Is a Program • That Examines A Database • Upon Request or Automatically • Delivers Results or Creates Digest • In the Context of the Web A Search Engine Is • A Program That Examines Databases of HTML Documents • Databases Gathered by a Robot • Upon Request • Delivers Results Via HTML Document

  10. Robot Purposes • Robots Are Used for a Number of Tasks • Indexing • Just Like a Book Index • HTML Validation • Link Validation • Searching for Broken Links • What's New Monitoring • Mirroring • Making a Copy of a Primary Web Site • On a Separate Server • More Local to Some Users • Shares the Work Load With the Primary Server

  11. Other Popular Names • All Names for the Same Sort of Program • With Slightly Different Connotations • Web Spiders • Sounds Cooler in the Media • Web Crawlers • Webcrawler Is a Specific Robot • Web Worms • A Worm Is a Replicating Program • Web Ants • Distributed Cooperating Robots

  12. Robot Ethics • Robots Have Enjoyed a Checkered History • Certain Robot Programs Can • And Have in the Past • Overload Networks & Servers • With Numerous Requests • This Happens Especially With Programmers • Just Starting to Write a Robot Program • These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes • But Does Everyone Read It?

  13. Robot Ethics • Robots Have Enjoyed a Checkered History • Robots Are Operated by Humans • Can Make Mistakes in Configuration • Don't Consider the Implications of Actions • This Means • Robot Operators Need to Be Careful • Robot Authors Need to Make It Difficult for Operators to Make Mistakes • With Bad Effects

  14. Robot Ethics • Robots Have Enjoyed a Checkered History • Indexing Robots Build Central Database of Documents • Which Doesn't Always Scale Well • To Millions of Documents • On Millions of Sites • Many Different Problems Occur • Missing Sites & Links • High Server Loads • Broken Links

  15. Robot Ethics • Robots Have Enjoyed a Checkered History • Majority of Robots Are • Well Designed • Professionally Operated • Cause No Problems • Provide a Valuable Service • Robots Aren't Inherently Bad • Nor Are They Inherently Brilliant • They Just Need Careful Attention

  16. Robot Visitation Strategies • Generally Start From Historical URL List • Especially Documents With Many or Certain Links • Server Lists • What's New Pages • Most Popular Sites on the Web • Other Sources for URLs Are Used • Scans Through USENET Postings • Published Mailing List Archives • Robot Selects URLs to Visit, Index, & Parse • And Use As a Source for New URLs

  17. Robot Indexing Strategies • If an Indexing Robot Is Aware of a Document • Robot May Decide to Parse Document • Insert Document Content Into Robot's Database • Decision Depends on the Robot • Some Robots Index • HTML Titles • The First Few Paragraphs • Parse the Entire HTML & Index All Words • With Weightings Depending on HTML Constructs • Parse the META Tag • Or Other Special Internal Tags

  18. Robot Visitation Strategies • Many Indexing Services Also Allow Web Developers to Submit URL Manually • Which Is Queued • Visited by the Robot • Exact Process Depends on Robot Service • Many Services Have a Link to a URL Submission Form on Their Search Page • Certain Aggregators Exist • Which Purport to Submit to Many Robots at Once http://www.submit-it.com/

  19. Determining Robot Activity • Examine Server Logs • Examine User-Agent, If Available • Examine Host Name or IP Address • Check for Many Accesses in Short Time Period • Check for Robot Exclusion Document Access • Found at: /robots.txt

  20. Apache Access Log Snippet "GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX" "GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX" "GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver" "GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63" "GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw-crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)" "GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com; http://www.inktomi.com/slurp.html)"

  21. After Robot Visitation • Some Webmasters Panic After Being Visited • Generally Not a Problem • Generally a Benefit • No Relation to Viruses • Little Relation to Hackers • Close Relation to Lots of Visits

  22. Controlling Robot Access • Excluding Robots Is Feasible Using Server Authentication Techniques • .htaccess File & Directives • Deny From 0.0.0.0 (IP Address) • SetEnvIf User-Agent Robot is_a_robot • Can Increase Server Load • Seldom Required • More Often (Mis) Desired

  23. Robot Exclusion Standard • Robot Exclusion Standard Exists • Consists of Single Site-wide File • /robots.txt • Contains Directives, Comment Lines, & Blank Lines • Not a Locked Door • More of a "No Entry" Sign • Represents a Declaration of Owner's Wishes • May Be Ignored by Incoming Traffic • Much Like a Red Traffic Light • If Everyone Follows The Rules, The World's a Better Place

  24. Sample robots.txt File # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs

  25. Exclusion Standard Syntax # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism • Lines Beginning With '#' Are Comments • Comment Lines Are Ignored • Comments May Not Appear Mid-Line

  26. Exclusion Standard Syntax User-agent: webcrawler Disallow: • Specify That the Robot Named 'webcrawler' • Has Nothing Disallowed • It May Go Anywhere on This Site

  27. Exclusion Standard Syntax User-agent: lycra Disallow: / • Specify That the Robot Named 'lycra' • Has All URLs starting with '/' Disallowed • It May Go Nowhere on This Site • Because All URLs On This Server • Begin With Slash

  28. Exclusion Standard Syntax User-agent: * Disallow: /tmp Disallow: /logs • Specify That All Robots • Has URLs starting with '/tmp' & '/log' Disallowed • It May Not Access Any URLs Beginning With Those Strings • Note The '*' is a Special Token • Meaning "any other User-agent" • Regular Expressions Cannot Be Used

  29. Exclusion Standard Syntax • Two Common Configuration Errors • Wildcards Are Not Supported • Do Not Use 'Disallow: /tmp/*' • Use 'Disallow: /tmp' • Put Only One Path on Each Disallow Line • This May Change in a Future Version of the Standard

  30. robots.txt File Location • The Robot Exclusion File Must be Placed at The Server's Document Root • For example: Site URL Corresponding Robots.txt URL http://www.w3.org/ -> http://www.w3.org/robots.txt http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt http://w3.org/ -> http://w3.org/robots.txt

  31. Common Mistakes • Urls Are Case Sensitive • "/robots.txt" must be all lower-case • Pointless robots.txt URLs http://www.w3.org/admin/robots.txt http://www.w3.org/~timbl/robots.txt • On a Server With Multiple Users • Like linus.ulltra.com • robots.txt Cannot Be Placed in Individual Users' Directories • It Must Be Placed in the Server Root • By the Server Administrator

  32. For Non-System Administrators • Sometimes Users Have Insufficient Authority to Install a /robots.txt File • Because They Don't Administer the Entire Server • Use META Tag In individual HTML Documents to Exclude Robots <META NAME="ROBOTS" CONTENT="NOINDEX"> • Prevents Document From Being Indexed <META NAME="ROBOTS" CONTENT="NOFOLLOW"> • Prevents Document Links From Being Followed

  33. Bottom Line • Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed • Don't Use It to Exclude Visitors • Don't Use It to Secure Sensitive Content • Use Authentication If It's Important • Use SSL If It's Really Important

More Related