1 / 28

Using the Web Efficiently: Mobile Crawlers

Using the Web Efficiently: Mobile Crawlers. August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu. Presentation Outline. Problem Statement and Goal Web Crawling Techniques Traditional Web Crawling Mobile Web Crawling Mobile Crawling Architecture

yaholo
Download Presentation

Using the Web Efficiently: Mobile Crawlers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

  2. Presentation Outline • Problem Statement and Goal • Web Crawling Techniques • Traditional Web Crawling • Mobile Web Crawling • Mobile Crawling Architecture • Distributed Runtime Environment • Application Framework • Performance Evaluation • Summary and Future Work Joachim Hammer - UF

  3. What’s Wrong with the Web? • Web represents large distributed hypertext system • 1.6 million Web sites • 320 million Web documents • 40% of the Web content changes within a month • Exponential growth rate • Lacks structure (i.e. no strict hierarchy) Joachim Hammer - UF

  4. Web Indices and Search Engines • Search engine statistics: • Index size: 30-110 million pages (approx. 700GB) • Web coverage: 10%-35% and decreasing! • Daily crawl: 3-10 million pages (approx. 60GB) • Year 2000 estimates: • Index size 880 million pages (approx. 5.6TB) • Daily crawl 80 million pages (approx. 480GB) • Traditional Web crawling will experience severe scaling problems in the near future Joachim Hammer - UF

  5. Goals Long-term • Overlay the distributed Web structure with a centralized information system which allows efficient resource discovery • Turn Web into an effectively organized and cataloged “digital library” • Topic-specific search engines, e.g., self-health care, consumer electronics, etc. Project • Find an alternative to the current “brute-force” approach to Web crawling/indexing Joachim Hammer - UF

  6. Traditional Web Crawling Approach Based on the Google search engine (www.google.com) Joachim Hammer - UF

  7. Traditional Web Crawling • Characteristics of traditional Web crawling: • Remote data access • Focus on rapid data retrieval • Centralized, database oriented architecture • Resource intensive • Traditional Web crawling techniques do not exploit information about the pages being crawled • “Download first–process later” approach Joachim Hammer - UF

  8. (Our) Mobile Crawling Approach Joachim Hammer - UF

  9. Mobile Web Crawling • Crawler code migrates to host sites where pages are located • Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission • Characteristics: • Focus on effective data retrieval • Distributed, data source oriented architecture: access data where it is stored • Intelligent downloading of only relevant Web content • Resource preserving approach • “Process first–download later” approach Joachim Hammer - UF

  10. State-of-the-Art in Mobile Crawling • Many search engines, little information • e.g., WWW Worm, WebCrawler, Lycos, Altavista, Infoseek, Excite, HotBot, Google • Distributed search engines • e.g., Harvest • Mobile code • Simplest form: downloadable Java applets • Code migration, software agents–very active research area • e.g, IBM Aglet infrastructure • Crawling algorithms Joachim Hammer - UF

  11. Mobile Crawling Architecture Joachim Hammer - UF

  12. Architecture Highlights • Distributed Crawler Runtime Environment • Platform independent execution environment for crawlers • Virtual machine for remote crawler execution • Communication layer to provide crawler transport service • Application Framework • Communication layer to provide crawler transport service • Crawler manager to support crawler creation and configuration, controls crawler migration • Web site selection • Query engine as crawler/application (database) interface • Archive manager as database connectivity framework Joachim Hammer - UF

  13. A Word About Mobile Crawlers • Crawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) • Use CLIPS to represent crawler data (e.g., page-facts) and user-defined crawling strategies (as rules) • CLIPS - C Language Integrated Production System • Advantages of rule-base approach • Easier to specify crawling rules than to devise a crawling algorithm • No need to model control flow • Rule-based programs have simple runtime states Joachim Hammer - UF

  14. Crawling Strategies • General-purpose search engines use simple strategies • Crawl and index all pages, e.g., depth-first • For subject-specific crawling, strategy is important • Find as many of the important pages while crawling the fewest number of pages overall • Page importance [Cho et al., Stanford University] • Keyword frequency and location • Backlink count • PageRank • Figure out when to return to crawler manager • Memory management issue Joachim Hammer - UF

  15. Crawler Virtual Machine • How to execute a rule based crawler specification? • Crawler execution = rule application upon fact base • Use inference engine (JESS) for the rule application process • JESS is platform independent and extensible 1. Initialization • Insert rules and facts into inference engine 2. Rule application • Start rule application process within inference engine 3. Finalization • Extract rules and facts once the rule application stopped • Store back into crawler Joachim Hammer - UF

  16. Crawler Virtual Machine Joachim Hammer - UF

  17. Crawler Manager • Centralized control unit for mobile crawlers • Create new crawlers based on user-defined crawler specification • Provide initial destination using seed URLs • Determine subsequent itinerary for crawlers • Requires knowledge about mobile-crawler-enabled sites • Estimate the importance of Web sites, e.g., Backlink count, hot page count, ... • Coordinate many crawlers running in parallel Joachim Hammer - UF

  18. Crawler Query Engine • Used to access crawler contents after returning • “Hot pages” • Characteristics • Provide a query facility to query the crawler fact base • Implement an SQL subset as query language • Represent query result as data tuples, not as facts • Allows the user to reason about crawling results • Query engine implementation uses inference engine • Query engine serves as the primary interface between the user application and the mobile crawler • SQL Database which holds Web index Joachim Hammer - UF

  19. Crawler Query Engine Crawler Object Crawler Rules Crawler Facts Query Engine Inference Engine Crawler Facts Query Result Tuples User Query Rule Compiler Query Joachim Hammer - UF

  20. Mobile Crawling Advantages • Remote page selection • Determine significance of a page prior to transmission • Applicable for topic-specific search engines • Remote page filtering • Control the granularity of the retrieved data • Applicable for non-fulltext search engines • Remote page compression • Compress page data prior to transmission • Applicable for all search engines Joachim Hammer - UF

  21. Performance Evaluation Setup • Two virtual machines (local and remote) plus crawler management system • Set up for mobile as well as traditional Web crawling Joachim Hammer - UF

  22. Performance Evaluation • Focus on proving viability of mobile crawling approach • Not focusing on analyzing crawling strategies [Cho97] • Controlled environment setup • Static HTML data set with known properties - subset of University of Florida intranet • Apache HTTP server, unshared communication channel • Breadth-first crawling strategy - predictable crawler behavior • Measurements 1. Network load for traditional (stationary) crawler 2. Network load for mobile crawler without page compression 3. Network load for mobile crawler with page compression Joachim Hammer - UF

  23. Benefit of Remote Page Selection • Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection Joachim Hammer - UF

  24. Benefit of Remote Page Filtering • Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved) Joachim Hammer - UF

  25. Benefit of Page Compression • Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages Joachim Hammer - UF

  26. Cost Benefit Analysis • Overhead • Overhead due to crawler migration (<5K) • Overhead due to fact-based data representation (6%) • Benefits without page compression • As soon as less than 85% per page needs to be preserved • As long as less than 90% of all pages are transmitted • Benefits with page compression • Reduction in network load by a factor of 4.5 Joachim Hammer - UF

  27. Summary and Conclusion • Mobile crawling advantages: • Natural fit for distributed web environment • Well suited for topic-specific search engines • Small network overhead due to crawler mobility • Solves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely • Approach provides a base for smart Web crawling • Currently improving crawler memory management • Completing more realistic testbed consisting of ~10 mobile crawler-enabled Web sites within UF intranet Joachim Hammer - UF

  28. Ongoing/Future Work • Security • Crawler identification based on digital signatures • Restrict crawler execution to positive identified crawlers • Implement virtual machine as a secure sandbox • Crawler mobility support • Integrate virtual machine into web servers • Comparison with other infrastructures, e.g, IBM Aglet infrastructure (currently ongoing) • Mobile crawling algorithms (currently ongoing) • Optimize crawling algorithms, site relocation algorithms • Carry out analysis Joachim Hammer - UF

More Related