220 likes | 345 Views
Heritrix Mobile. Keith Enlow. Introduction. Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries. Experiment. Decision Making Heritrix
E N D
HeritrixMobile Keith Enlow
Introduction • Heritrix 3.1 • Mobile Finder Web Service • 2Options • Crawl desktop web pages (default) • Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries.
Experiment • Decision Making Heritrix • Web Service (Mobile Finder) Heritrix • Modified Heritrix 3.1 to include two options for crawling • Option 0: Crawl with desktop user agent • Option 1: Crawl with mobile user agent using Mobile Finder • Added built in mobile user agent adapted from Google Bot • Crawled a small set of URLs • Used Mobile Finder to find if the given URL has mobile version • Wrote a small script to discover differences between the mobile and desktop versions
<property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/@VERISON@+ @OPERATOR_CONTACT_URL@)"/> <property name="userAgentTemplateMobile" value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; heritrix/@VERSION@+ @OPERATOR_CONTACT_URL@"/> <!-- Option # = Description 0 [Default] Crawl using desktop user agent 1 Crawl using mobile user agent + Mobile Finder Web Service --> <property name="CrawlOption" value="0" />
URLs Crawled Desktop URL Mobile URL www.huffingtonpost.com www.foxnews.com www.nbcnews.com www.whitehouse.gov www.nasa.gov www.ssa.gov www.cornell.edu www.stanford.edu www.mit.edu m.huffpost.com foxnews.mobi www.nbcnews.com m.whitehouse.gov mobile.nasa.gov www.ssa.gov/mobile m.cornell.edu/#home m.stanford.edu m.mit.edu/mobile.mit.edu
Redirection/Delivery • 200 Response (server side redirect) • 302 “Temporary” relocation • 301 “Permanent” relocation • JavaScript Redirection (client side redirect) • Media Queries • Style Sheets
Tiny Limits • No JavaScript Engine • Heritrix is unable to perform and execute JavaScript code • Unable to catch client side redirection and will instead continue to crawl the desktop version of the web page. Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl. • www.nasa.gov • www.ssa.gov • www.cornell.edu