1 / 9

The Easiest Tutorial To Learn About Robots.txt File

in this tutorial I am going to explain everything about the Robots.txt file for the basic beginners.<br>For more detailed information visit - www.crackaloud.com

CrackAloud
Download Presentation

The Easiest Tutorial To Learn About Robots.txt File

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LEARN ABOUT ROBOTS.TXT FILE

  2. What is Robots.txt? • Robots.txt is a plain text file that is being uploaded to the root directory of the website site. • Once your site is reached by the web spiders (ants, boots, indexers) that index your webpages, they first look at Robots.txt file and process it. • On the other hand, robots.txt says to the spider which pages to crawl. • Also read: How to Create robots.txt and Upload

  3. REP Explained: • To communicate with web crawlers and other web robots, a standard is used by the website and it is called the robots exclusion protocol (REP), or robots.txt. • It is a simple text file created by the webmasters to instruct the search engine robots how to crawl and index pages on their website. • The /robots.txt is a de – facto standard, and is not owned by any standards body.

  4. The Simplest Syntax: The simplest version of robots.txt file is: • User-agent:* • Disallow: • The first line of the code indicates that the following lines apply to all agents. • And the second line of the code indicates that nothing is limited. • This robots.txt file does nothing – it allows user agents to see everything on the site.

  5. Important rules: • In most cases, Meta robots with parameters “no index, follow” should be employed as a way to restrict crawling or indexation. • It is important to note that malicious crawlers are likely to completely ignore robots.txt and such, this protocol does not make a good security mechanism. • Only one “Disallow:” line is allowed for each URL. • Each subdomain on a root domain uses separate robots.txt files. • The filename of robots.txt is case sensitive. Use “robots.txt”, not “Robots.TXT”. • Spacing is not an accepted way to separate query parameters. For example, “/category/ /product page” would not be honored by robots.txt.

  6. Robotic HTTP: • It is similar like any other HTTP client program. • Many robots try to implement the minimum amount of HTTP needed to request the content they seek. • It is recommended that robot implementers send some basic header information to notify the site of the capabilities of the robot, the robot identify, and where it originated.

  7. Identifying request header: User-Agent Tell the server the robot’s name From Tell the email of the robot’s user/admin email. Accept Tell the server what media types are okay to send. (E.g. only fetch text and sound) Referrer Tell the server how a robot found links to this site’s content.

  8. Misbehaving Robots: Runaway robot Robots issue HTTP requests as fast as they can. Stale URLs Robots visit the old lists of URLs. Long, wrong URLs May reduce web server’s performance, clutter server’s access logs, even crash server. Nosy robots Some robots may get URLs that point to private data and make that data easily accessible through search engine. Dynamic gateway access Robots don’t always know what they are accessing.

  9. How to check your robots.txt file? • You can check this file on your blog by adding /robots.txt at last to your blog URL in the browser. For example: http://example.blogspot.com/robots.txt

More Related