220 likes | 370 Views
White Hat Cloaking – Six Practical Applications. Presented by Hamlet Batista. “Good” vs “bad” cloaking is all about your intention Always weigh the risks versus the rewards of cloaking Ask permission— or just don’t call it cloaking! Cloaking vs “IP delivery”. Why white hat cloaking?.
E N D
White Hat Cloaking – Six Practical Applications Presented by Hamlet Batista
“Good” vs “bad” cloaking is all about your intention Always weigh the risks versus the rewards of cloaking Ask permission— or just don’t call it cloaking! Cloaking vs “IP delivery” Why white hat cloaking? Page 2
Crash course in white hat cloaking Practical scenarios where good cloaking makes sense 1 When to cloak? 2 Practical scenarios and alternatives 3 How do we cloak? 4 How can cloaking be detected? 5 Risks and next steps Page 3
Content accessibility Search unfriendly Content Management Systems Rich media sites Content behind forms Membership sites Free and paid content Site structure improvements Alternative to PR sculpting via “no-follow“ Geolocation/IP delivery Multivariate testing When is practical to cloak? Page 4
Practical scenario #1 Proprietary website management systems that are not search-engine friendly Regular users see Search engine robot sees • URLs with many dynamic parameters • URLs with session IDs • URLs with canonicalization issues • Missing titles and meta descriptions • Search engine friendly URLs • URLs without session IDs • URLs with a consistent naming convention • Automatically generated titles and meta descriptions Page 5
Practical scenario #2 Sites built completely in Flash, Silverlight or any other rich media technology Search engine robot sees • A text representation of all graphical (images) elements • A text representation of all motion (video) elements • A text transcription of all audio in the rich media content Your text Page 6
Practical scenario #3 Membership sites Search users see • Snippets of premium content on the SERPs • When they land on the site they are faced with a registration form Your text Members sees • The same content search engine robots see Page 7
Step 1 Step 1 Step 2 Step 2 Step 3 Step 3 Step 4 Step 5 Step 5 Practical scenario #4 Sites requiring massive site strucuture changes to improve index penetration Regular users follow a link structure designed for ease of navigation Step 4 Search engine robots follow a link structure designed for ease of crawling and deeper index penetration of the most important content Page 8
Practical scenario #5 Sites using geolocation technology Regular users see • Content tailored to their geographical location and/or user’s language Your text Search engine robot sees • The same content consistently Page 9
Practical scenario #6 Split testing organic search landing pages Each regular user sees • One of the content experiment alternatives Your text Search engine robot sees • The same content consistently Page 10
How do we cloak? Cloaking is performed with a web server script or module Search robot detection Content delivery • By HTTP User agent • By IP address • By HTTP cookie test • By JavaScript/CSS test • By DNS double check • By visitor behavior • By combining all the techniques • Presenting the equivalent of the inaccesible content to robots • Presenting the search-engine friendly content to robots • Presenting the content behind forms robots Page 11
Robot detection by HTTP user agent A very simple robot detection technique Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-” Page 12
Robot detection by HTTP cookie test Another simple robot detection technique, but weaker Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “Missing cookie info” Page 13
Robot detection by JavaScript/CSS test Another option for robot detection DHTML Content HTML Code <div id="header"><h1><a href="http://www.example.com" title="Example Site">Example site</a></h1></div> and the CSS code is pretty straight forward, it swaps out anything in the h1 tag in the header with an image CSS Code /* CSS Image replacement */ #header h1 {margin:0; padding:0;} #header h1 a { display: block; padding: 150px 0 0 0; background: url(path to image) top right no-repeat; overflow: hidden; font-size: 1px; line-height: 1px; height: 0px !important; height /**/:150px; } Page 14
Robot detection by IP address A more robust robot detection technique Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-” Page 15
Robot detection by double DNS check A more robust robot detection technique Search robot HTTP request • nslookup • 66.249.66.1 • Name: crawl-66-249-66-1.googlebot.com • Address: 66.249.66.1 • crawl-66-249-66-1.googlebot.com • Non-authoritative answer: • Name: crawl-66-249-66-1.googlebot.com • Address: 66.249.66.1 Page 16
Robot detection by visitor behavior Robots differ substantially from regular users when visiting a website Your text Page 17
Combining the best of all techniques Label a robot anything that identifies as such Maintain a cache with a list of known search robots to reduce the number of verification attempts User Agent Check User Behavior Check Double DNS check IP Address Check Confirm it is a robot by doing a double DNS check. Also confirm suspect robots Label as possible robot any visitor with suspicious behavior Page 18
Clever cloaking detection A clever detection technique is to check the caches at the newest datacenters • IP-based detection techniques rely on an up-to-date list of robot IPs • Search engines change IPs on a regular basis • It is possible to identify those new IPs and check the cache Your text Page 19
Risks of cloaking Search engines do not want to accept any type of cloaking Survival tips • The safest way to cloak is to ask for permission from each of the search engines that you care about • Refer to it as IPdelivery. • Cloaking: Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then you're in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical. • http://googlewebmastercentral.blogspot.com/2008/06/how-google-defines-ip-delivery.html Your text Page 20
Make sure clients understand the risks/rewards of implementing white hat cloaking More information and how to get started How Google defines IP delivery, geolocation and cloaking http://googlewebmastercentral.blogspot.com/2008/06/how-google-defines-ip-delivery.html First Click Free http://googlenewsblog.blogspot.com/2007/09/first-click-free.html Good Cloaking, Evil Cloaking and Detection http://searchengineland.com/070301-065358.php YADAC: Yet Another Debate About Cloaking Happens Again http://searchengineland.com/070304-231603.php Cloaking is OK Says Google http://blog.venture-skills.co.uk/2007/07/06/cloaking-is-ok-says-google/ Advanced Cloaking Technique: How to feed password-protected content to search engine spiders http://hamletbatista.com/2007/09/03/advanced-cloaking-technique-how-to-feed-password-protected-content-to-search-engine-spiders/ Next Steps Page 21
Blog http://hamletbatista.com • LinkedIn http://www.linkedin.com/in/hamletbatista • Facebook http://www.facebook.com/people/Hamlet_Batista/613808617 • Twitter http://twitter.com/hamletbatista • E-mail hamlet@hamletbatista.com ? Feel free to contact me ? ? I would be happy to help. Page 22