280 likes | 475 Views
Using the Web Efficiently: Mobile Crawlers. August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu. Presentation Outline. Problem Statement and Goal Web Crawling Techniques Traditional Web Crawling Mobile Web Crawling Mobile Crawling Architecture
E N D
Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu
Presentation Outline • Problem Statement and Goal • Web Crawling Techniques • Traditional Web Crawling • Mobile Web Crawling • Mobile Crawling Architecture • Distributed Runtime Environment • Application Framework • Performance Evaluation • Summary and Future Work Joachim Hammer - UF
What’s Wrong with the Web? • Web represents large distributed hypertext system • 1.6 million Web sites • 320 million Web documents • 40% of the Web content changes within a month • Exponential growth rate • Lacks structure (i.e. no strict hierarchy) Joachim Hammer - UF
Web Indices and Search Engines • Search engine statistics: • Index size: 30-110 million pages (approx. 700GB) • Web coverage: 10%-35% and decreasing! • Daily crawl: 3-10 million pages (approx. 60GB) • Year 2000 estimates: • Index size 880 million pages (approx. 5.6TB) • Daily crawl 80 million pages (approx. 480GB) • Traditional Web crawling will experience severe scaling problems in the near future Joachim Hammer - UF
Goals Long-term • Overlay the distributed Web structure with a centralized information system which allows efficient resource discovery • Turn Web into an effectively organized and cataloged “digital library” • Topic-specific search engines, e.g., self-health care, consumer electronics, etc. Project • Find an alternative to the current “brute-force” approach to Web crawling/indexing Joachim Hammer - UF
Traditional Web Crawling Approach Based on the Google search engine (www.google.com) Joachim Hammer - UF
Traditional Web Crawling • Characteristics of traditional Web crawling: • Remote data access • Focus on rapid data retrieval • Centralized, database oriented architecture • Resource intensive • Traditional Web crawling techniques do not exploit information about the pages being crawled • “Download first–process later” approach Joachim Hammer - UF
(Our) Mobile Crawling Approach Joachim Hammer - UF
Mobile Web Crawling • Crawler code migrates to host sites where pages are located • Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission • Characteristics: • Focus on effective data retrieval • Distributed, data source oriented architecture: access data where it is stored • Intelligent downloading of only relevant Web content • Resource preserving approach • “Process first–download later” approach Joachim Hammer - UF
State-of-the-Art in Mobile Crawling • Many search engines, little information • e.g., WWW Worm, WebCrawler, Lycos, Altavista, Infoseek, Excite, HotBot, Google • Distributed search engines • e.g., Harvest • Mobile code • Simplest form: downloadable Java applets • Code migration, software agents–very active research area • e.g, IBM Aglet infrastructure • Crawling algorithms Joachim Hammer - UF
Mobile Crawling Architecture Joachim Hammer - UF
Architecture Highlights • Distributed Crawler Runtime Environment • Platform independent execution environment for crawlers • Virtual machine for remote crawler execution • Communication layer to provide crawler transport service • Application Framework • Communication layer to provide crawler transport service • Crawler manager to support crawler creation and configuration, controls crawler migration • Web site selection • Query engine as crawler/application (database) interface • Archive manager as database connectivity framework Joachim Hammer - UF
A Word About Mobile Crawlers • Crawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) • Use CLIPS to represent crawler data (e.g., page-facts) and user-defined crawling strategies (as rules) • CLIPS - C Language Integrated Production System • Advantages of rule-base approach • Easier to specify crawling rules than to devise a crawling algorithm • No need to model control flow • Rule-based programs have simple runtime states Joachim Hammer - UF
Crawling Strategies • General-purpose search engines use simple strategies • Crawl and index all pages, e.g., depth-first • For subject-specific crawling, strategy is important • Find as many of the important pages while crawling the fewest number of pages overall • Page importance [Cho et al., Stanford University] • Keyword frequency and location • Backlink count • PageRank • Figure out when to return to crawler manager • Memory management issue Joachim Hammer - UF
Crawler Virtual Machine • How to execute a rule based crawler specification? • Crawler execution = rule application upon fact base • Use inference engine (JESS) for the rule application process • JESS is platform independent and extensible 1. Initialization • Insert rules and facts into inference engine 2. Rule application • Start rule application process within inference engine 3. Finalization • Extract rules and facts once the rule application stopped • Store back into crawler Joachim Hammer - UF
Crawler Virtual Machine Joachim Hammer - UF
Crawler Manager • Centralized control unit for mobile crawlers • Create new crawlers based on user-defined crawler specification • Provide initial destination using seed URLs • Determine subsequent itinerary for crawlers • Requires knowledge about mobile-crawler-enabled sites • Estimate the importance of Web sites, e.g., Backlink count, hot page count, ... • Coordinate many crawlers running in parallel Joachim Hammer - UF
Crawler Query Engine • Used to access crawler contents after returning • “Hot pages” • Characteristics • Provide a query facility to query the crawler fact base • Implement an SQL subset as query language • Represent query result as data tuples, not as facts • Allows the user to reason about crawling results • Query engine implementation uses inference engine • Query engine serves as the primary interface between the user application and the mobile crawler • SQL Database which holds Web index Joachim Hammer - UF
Crawler Query Engine Crawler Object Crawler Rules Crawler Facts Query Engine Inference Engine Crawler Facts Query Result Tuples User Query Rule Compiler Query Joachim Hammer - UF
Mobile Crawling Advantages • Remote page selection • Determine significance of a page prior to transmission • Applicable for topic-specific search engines • Remote page filtering • Control the granularity of the retrieved data • Applicable for non-fulltext search engines • Remote page compression • Compress page data prior to transmission • Applicable for all search engines Joachim Hammer - UF
Performance Evaluation Setup • Two virtual machines (local and remote) plus crawler management system • Set up for mobile as well as traditional Web crawling Joachim Hammer - UF
Performance Evaluation • Focus on proving viability of mobile crawling approach • Not focusing on analyzing crawling strategies [Cho97] • Controlled environment setup • Static HTML data set with known properties - subset of University of Florida intranet • Apache HTTP server, unshared communication channel • Breadth-first crawling strategy - predictable crawler behavior • Measurements 1. Network load for traditional (stationary) crawler 2. Network load for mobile crawler without page compression 3. Network load for mobile crawler with page compression Joachim Hammer - UF
Benefit of Remote Page Selection • Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection Joachim Hammer - UF
Benefit of Remote Page Filtering • Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved) Joachim Hammer - UF
Benefit of Page Compression • Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages Joachim Hammer - UF
Cost Benefit Analysis • Overhead • Overhead due to crawler migration (<5K) • Overhead due to fact-based data representation (6%) • Benefits without page compression • As soon as less than 85% per page needs to be preserved • As long as less than 90% of all pages are transmitted • Benefits with page compression • Reduction in network load by a factor of 4.5 Joachim Hammer - UF
Summary and Conclusion • Mobile crawling advantages: • Natural fit for distributed web environment • Well suited for topic-specific search engines • Small network overhead due to crawler mobility • Solves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely • Approach provides a base for smart Web crawling • Currently improving crawler memory management • Completing more realistic testbed consisting of ~10 mobile crawler-enabled Web sites within UF intranet Joachim Hammer - UF
Ongoing/Future Work • Security • Crawler identification based on digital signatures • Restrict crawler execution to positive identified crawlers • Implement virtual machine as a secure sandbox • Crawler mobility support • Integrate virtual machine into web servers • Comparison with other infrastructures, e.g, IBM Aglet infrastructure (currently ongoing) • Mobile crawling algorithms (currently ongoing) • Optimize crawling algorithms, site relocation algorithms • Carry out analysis Joachim Hammer - UF