SLASHPack: Collector Performance Improvement and Evaluation

Rudd Stevens CS 690 Spring 2006 SLASHPack: CollectorPerformance Improvement andEvaluation SLASHPack Collector - 5/4/2006

SLASHPack Collector - 5/4/2006 Outline 1. Introduction, system overview and design. 2. Performance modifications, re-factoring and re-structuring. 3. Performance testing results and evaluation.

SLASHPack Collector - 5/4/2006 Introduction SLASHPack Toolkit (Semi-LArge Scale Hypertext Package) Sponsored by Prof. Chris Brooks, engineered for initial clients Nancy Montanez and Ryan King. Collector component Framework for collecting documents. Evaluate and improve performance.

SLASHPack Collector - 5/4/2006 Contact and Information Sources Contact Information: Rudd Stevens rstevens (at) cs.usfca.edu Project Website: http://www.cs.usfca.edu/~rstevens/slashpack/collector/ Project Sponsor: Professor Christopher Brooks Department of Computer Science University of San Francisco cbrooks (at) cs.usfca.edu

SLASHPack Collector - 5/4/2006 Stages Addition of protocol module for Weblog data set. Performance testing using the Weblog and HTTP modules. Identify problem areas. Modify Collector to improve scalability and performance. Repeat performance testing and evaluate performance improvements.

SLASHPack Collector - 5/4/2006 Implementation Language: Python Platform: Any Python supported OS. Python 2.4 or later (Developed and tested under Linux.) Progress: Fully built, newly re-factored for performance and usability.

SLASHPack Collector - 5/4/2006 High level design SLASHPack designed as a framework. Modular components, that contain sub-modules. Collector pluggable for protocol modules, parsers, filters, output writers, etc.

SLASHPack Collector - 5/4/2006 High level design (cont.)

SLASHPack Collector - 5/4/2006 Performance Testing Large scale text collection. Weblog data set. Long web crawls. Performance testing monitoring Python Profiling. Integrated Statistics. Functionality Testing Python logging. Functionality test runs.

SLASHPack Collector - 5/4/2006 UrlSentry Urls filtered using robots: 38 Urls filtered for depth: 9 Urls Processed: 5881 Urls filtered using filters: 165 UrlBookkeeper Duplicate Urls: 1557 Urls recorded: 4104 Collector Runtime Statistics UrlFrontier Url Frontier size, current number of links: 3465 Urls requested from frontier: 659 Url Frontier, current number of server queues: 78 Urls delivered from frontier: 639 Collector Documents per second: 3.70328865405 Total runtime: 2 Minutes 31.4869761467 Seconds

SLASHPack Collector - 5/4/2006 Collector Runtime Statistics DocFingerprinter Documents Written: 386 Average Document Size (bytes): 20570 HTTP Status Responses: 200: 394 204: 10 301: 8 302: 25 404: 91 403: 7 401: 1 400: 24 500: 1 Duplicate Documents: 51 Total Documents Collected: 561 Documents by mimetype: text/xml: 1 image/jpeg: 1 text/html: 451 image/gif: 1 text/plain: 106 application/octet-stream: 1

SLASHPack Collector - 5/4/2006 Challenges Large text (XML) files 21 1 GB XML files. ~450,000 files per XML file. ~10 Million files, after processing. Memory/Storage Disk space. Memory usage during processing. (XML)

SLASHPack Collector - 5/4/2006 Weblog raw data <post> <weblog_url> http://www.livejournal.com/users/chuckdarwin </weblog_url><weblog_title> ""Evolve!"“ </weblog_title><permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html </permalink> <post_title> Flickr </post_title> <author_name> Darwin (chuckdarwin) </author_name> <date_posted> 2005-07-09 </date_posted> <time_posted> 000000 </time_posted> <content> <html><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><title>""Evolve!""</title></head><body> <div style="text-align: center;"><font size="+1"><a href="http://www.nytimes.com/2005/07/09/arts/09boxe.html?ei=5088&en=61cfcd5835008b1a&ex=1278561600&partner=rssnyt&emc=rss&pagewanted=print">7/7 and 9/11?</a></font></div></body></html> </content> <outlinks> <outlink> <url> http://www.nytimes.com/2005/07/09/arts/09boxe.html </url> <site> http://www.nytimes.com </site> <type> Press </type> </outlink> </outlinks> </post>

SLASHPack Collector - 5/4/2006 Weblog processed data <spdata> <url>http://www.livejournal.com/users/chuckdarwin</url><date>20060212</date> <crawlname>WeblogPosts20050709</crawlname> <weblog> <weblog_title>”"Evolve!""</weblog_title> <permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html</permalink> <post_title>Flickr</post_title> <author_name>Darwin (chuckdarwin)</author_name> <date_posted>2005-07-09</date_posted> <time_posted>000000</time_posted> <outlinks> <outlink> <type>Press</type> <url>http://www.nytimes.com/2005/07/09/arts/09boxe.html</url> <site>http://www.nytimes.com</site> </outlink> </outlinks> </weblog> <tags></tags> <size>493</size> <mimetype>text/plain</mimetype> <fingerprint>9949bba4ac535d18c3f11db66cdb194e</fingerprint> <content>Jmx0O2h0bWwmZ3Q7CiZsdDtoZWFkJmd0OwombHQ7bWV0YSBjb250ZW50P ….</content> </spdata>

SLASHPack Collector - 5/4/2006 Original Design

SLASHPack Collector - 5/4/2006 Problems to Address Overall collection performance Streamline processing. Robot file look up Incredibly slow and inefficient. (Not mine!) Thread interaction Efficient use of threads and queues to process data. Inefficient code Python code not always the fastest. miniDom XML parsing. Faster data structures Re-work collection protocols, DNS prefetch. Re-structure URL Frontier, URL Bookkeeper.

SLASHPack Collector - 5/4/2006 New Design

SLASHPack Collector - 5/4/2006 Performance Modifications Structure Re-design (threading) More queues, more independence. Robot Parser String creation, debug calls. URL Frontier More efficient data structures. Protocol Modules More efficient data structures. Re-factoring for reliable collection. XML parsing Switch to faster parser, removal of DOM parser. DNS Pre-fetching More efficient structuring.

SLASHPack Collector - 5/4/2006 New data structures Dictionary fields for Base data type. (Must be implemented by any data protocol). Now passed in dictionary to storage component. Key Value Type datatype user defined datatype name string status HTTP document status string url URL of document string date collection date string crawlname name of current crawl string size byte length of content string mimetype mime type of document string fingerprint md5sum hash of content string content raw text of document string

SLASHPack Collector - 5/4/2006 Performance Comparison Initial Results: Weblog data set w/o parsing, robots: 161 doc/s, 50 min. w/ parsing, robots: 3.9 doc/s, 162 min. (killed) HTTP Web crawl 100 docs w/ parsing, robots: 0.2 doc/s,16 min:13s 150 docs w/ parsing, robots: 0.3 doc/s, 21min:3s Modified Results: Weblog data set w/o parsing, robots: 170 doc/s, 42 min. w/ parsing, robots: 186 doc/s, 63 min. HTTP Web crawl 100 docs w/ parsing, robots: 2.2 doc/s, 1min:10s 150 docs w/ parsing, robots: 2.9 doc/s, 1min:14s

SLASHPack Collector - 5/4/2006 Performance Comparison (cont.) Hardware considerations HTTP web crawl for 500 documents Pentium 4 2.4GHz 1 GB RAM 3.7 doc/s 3min:18s, 728 docs total (faster connection) Pentium 4 2.0GHz, 1GB RAM 3.7 doc/s 4min:25s, 725 docs total Pentium 4 3.2GHz HT, 2GB RAM 4.3 doc/s 2min:47s, 717 docs total (faster connection)

SLASHPack Collector - 5/4/2006 Performance Comparison (cont.) • Comparison to other web crawlers (published results, 1999) • Google: 33.5 doc/s • Internet Archive: 46.3 doc/s • Mercator: 112 doc/s • Consideration of functionality • More than just a web crawler. • Mime types.

SLASHPack Collector - 5/4/2006 Available Documentation Pydoc API Generated with Epydoc. Use and configuration guide (README). Quick start guide. Full Report Full specification of Collector, use, configuration and development background.

SLASHPack Collector - 5/4/2006 Future Work Addition of pluggable modules. Improved fingerprint sets. Improved Python memory management and threading.

SLASHPack Collector - 5/4/2006 References Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. http://research.compaq.com/SRC/mercator/papers/www/paper.pdf Soumen Chakrabati, Mining the Web, 2002. Ch. 2, pages 17-43. Heritrix, Internet Archive. http://crawler.archive.org/ Python Performance Tips http://wiki.python.org/moin/PythonSpeed/PerformanceTips Prof. Chris Brooks and the SLASHPack Team.

SLASHPack Collector - 5/4/2006 Conclusion • Four stages: • Addition of protocol module for Weblog data set. • Performance testing and identifying problem areas. • Modify Collector to improve scalability and performance. • Repeat performance testing and evaluate performance improvements. • Results: • Expanded functionality for data types. • Modifications improved performance. • More stable and flexible design.

SLASHPack: Collector Performance Improvement and Evaluation