Introduction to Web Robots, Crawlers & Spiders

Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph DiVerdi, Ph.D., MBA

Web Robot Defined • A Web Robot Is a Program • That Automatically Traverses the Web • Using Hypertext Links • Retrieving a Particular Document • Then Retrieving All Documents That Are Referenced • Recursively • Recursive Doesn't Limit the Definition • To Any Specific Traversal Algorithm • Even If a Robot Applies Some Heuristic to the Selection & Order of Documents to Visit & Spaces Out Requests Over a Long Time Period • It Is Still a Robot

Web Robot Defined • Normal Web Browsers Are Not Robots • Because the Are Operated by a Human • Don't Automatically Retrieve Referenced Documents • Other Than Inline Images

Web Robot Defined • Sometimes Referred to As • Web Wanderers • Web Crawlers • Spiders • These Names Are a Bit Misleading • They Give the Impression the Software Itself Moves Between Sites • Like a Virus • This Not the Case • A Robot Visits Sites by Requesting Documents From Them

Agent Defined • The Term Agent Is (Over) Used These Days • Specific Agents Include: • Autonomous Agent • Intelligent Agent • User-Agent

Autonomous Agent Defined • An Autonomous Agent Is a Program • That Automatically Travels Between Sites • Makes Its Own Decisions • When To Move, When To Stay • Are Limited to Travel Between Selected Sites • Currently Not Widespread on the Web

Intelligent Agent Defined • An Intelligent Agent Is a Program • That Helps Users With Certain Activities • Choosing a Product • Filling Out a Form • Find Particular Items • Generally Have Little to Do With Networking • Usually Created & Maintained by an Organization • To Assist Its Own Viewers

User-Agent Defined • An User-Agent Is a Program • Performs Networking Tasks for a User • Web User-Agent • Navigator • Internet Explorer • Opera • Email User-Agent • Eudora • FTP User-Agent • HTML-Kit • Fetch • cute_FTP

Search Engine Defined • A Search Engine Is a Program • That Examines A Database • Upon Request or Automatically • Delivers Results or Creates Digest • In the Context of the Web A Search Engine Is • A Program That Examines Databases of HTML Documents • Databases Gathered by a Robot • Upon Request • Delivers Results Via HTML Document

Robot Purposes • Robots Are Used for a Number of Tasks • Indexing • Just Like a Book Index • HTML Validation • Link Validation • Searching for Broken Links • What's New Monitoring • Mirroring • Making a Copy of a Primary Web Site • On a Separate Server • More Local to Some Users • Shares the Work Load With the Primary Server

Other Popular Names • All Names for the Same Sort of Program • With Slightly Different Connotations • Web Spiders • Sounds Cooler in the Media • Web Crawlers • Webcrawler Is a Specific Robot • Web Worms • A Worm Is a Replicating Program • Web Ants • Distributed Cooperating Robots

Robot Ethics • Robots Have Enjoyed a Checkered History • Certain Robot Programs Can • And Have in the Past • Overload Networks & Servers • With Numerous Requests • This Happens Especially With Programmers • Just Starting to Write a Robot Program • These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes • But Does Everyone Read It?

Robot Ethics • Robots Have Enjoyed a Checkered History • Robots Are Operated by Humans • Can Make Mistakes in Configuration • Don't Consider the Implications of Actions • This Means • Robot Operators Need to Be Careful • Robot Authors Need to Make It Difficult for Operators to Make Mistakes • With Bad Effects

Robot Ethics • Robots Have Enjoyed a Checkered History • Indexing Robots Build Central Database of Documents • Which Doesn't Always Scale Well • To Millions of Documents • On Millions of Sites • Many Different Problems Occur • Missing Sites & Links • High Server Loads • Broken Links

Robot Ethics • Robots Have Enjoyed a Checkered History • Majority of Robots Are • Well Designed • Professionally Operated • Cause No Problems • Provide a Valuable Service • Robots Aren't Inherently Bad • Nor Are They Inherently Brilliant • They Just Need Careful Attention

Robot Visitation Strategies • Generally Start From Historical URL List • Especially Documents With Many or Certain Links • Server Lists • What's New Pages • Most Popular Sites on the Web • Other Sources for URLs Are Used • Scans Through USENET Postings • Published Mailing List Archives • Robot Selects URLs to Visit, Index, & Parse • And Use As a Source for New URLs

Robot Indexing Strategies • If an Indexing Robot Is Aware of a Document • Robot May Decide to Parse Document • Insert Document Content Into Robot's Database • Decision Depends on the Robot • Some Robots Index • HTML Titles • The First Few Paragraphs • Parse the Entire HTML & Index All Words • With Weightings Depending on HTML Constructs • Parse the META Tag • Or Other Special Internal Tags

Robot Visitation Strategies • Many Indexing Services Also Allow Web Developers to Submit URL Manually • Which Is Queued • Visited by the Robot • Exact Process Depends on Robot Service • Many Services Have a Link to a URL Submission Form on Their Search Page • Certain Aggregators Exist • Which Purport to Submit to Many Robots at Once http://www.submit-it.com/

Determining Robot Activity • Examine Server Logs • Examine User-Agent, If Available • Examine Host Name or IP Address • Check for Many Accesses in Short Time Period • Check for Robot Exclusion Document Access • Found at: /robots.txt

Apache Access Log Snippet "GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX" "GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX" "GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver" "GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63" "GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw-crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)" "GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com; http://www.inktomi.com/slurp.html)"

After Robot Visitation • Some Webmasters Panic After Being Visited • Generally Not a Problem • Generally a Benefit • No Relation to Viruses • Little Relation to Hackers • Close Relation to Lots of Visits

Controlling Robot Access • Excluding Robots Is Feasible Using Server Authentication Techniques • .htaccess File & Directives • Deny From 0.0.0.0 (IP Address) • SetEnvIf User-Agent Robot is_a_robot • Can Increase Server Load • Seldom Required • More Often (Mis) Desired

Robot Exclusion Standard • Robot Exclusion Standard Exists • Consists of Single Site-wide File • /robots.txt • Contains Directives, Comment Lines, & Blank Lines • Not a Locked Door • More of a "No Entry" Sign • Represents a Declaration of Owner's Wishes • May Be Ignored by Incoming Traffic • Much Like a Red Traffic Light • If Everyone Follows The Rules, The World's a Better Place

Sample robots.txt File # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs

Exclusion Standard Syntax # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism • Lines Beginning With '#' Are Comments • Comment Lines Are Ignored • Comments May Not Appear Mid-Line

Exclusion Standard Syntax User-agent: webcrawler Disallow: • Specify That the Robot Named 'webcrawler' • Has Nothing Disallowed • It May Go Anywhere on This Site

Exclusion Standard Syntax User-agent: lycra Disallow: / • Specify That the Robot Named 'lycra' • Has All URLs starting with '/' Disallowed • It May Go Nowhere on This Site • Because All URLs On This Server • Begin With Slash

Exclusion Standard Syntax User-agent: * Disallow: /tmp Disallow: /logs • Specify That All Robots • Has URLs starting with '/tmp' & '/log' Disallowed • It May Not Access Any URLs Beginning With Those Strings • Note The '*' is a Special Token • Meaning "any other User-agent" • Regular Expressions Cannot Be Used

Exclusion Standard Syntax • Two Common Configuration Errors • Wildcards Are Not Supported • Do Not Use 'Disallow: /tmp/*' • Use 'Disallow: /tmp' • Put Only One Path on Each Disallow Line • This May Change in a Future Version of the Standard

robots.txt File Location • The Robot Exclusion File Must be Placed at The Server's Document Root • For example: Site URL Corresponding Robots.txt URL http://www.w3.org/ -> http://www.w3.org/robots.txt http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt http://w3.org/ -> http://w3.org/robots.txt

Common Mistakes • Urls Are Case Sensitive • "/robots.txt" must be all lower-case • Pointless robots.txt URLs http://www.w3.org/admin/robots.txt http://www.w3.org/~timbl/robots.txt • On a Server With Multiple Users • Like linus.ulltra.com • robots.txt Cannot Be Placed in Individual Users' Directories • It Must Be Placed in the Server Root • By the Server Administrator

For Non-System Administrators • Sometimes Users Have Insufficient Authority to Install a /robots.txt File • Because They Don't Administer the Entire Server • Use META Tag In individual HTML Documents to Exclude Robots <META NAME="ROBOTS" CONTENT="NOINDEX"> • Prevents Document From Being Indexed <META NAME="ROBOTS" CONTENT="NOFOLLOW"> • Prevents Document Links From Being Followed

Bottom Line • Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed • Don't Use It to Exclude Visitors • Don't Use It to Secure Sensitive Content • Use Authentication If It's Important • Use SSL If It's Really Important

Introduction to Web Robots, Crawlers & Spiders