370 likes | 547 Views
(Web) Crawlers Domain. Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch. Crawlers. 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions. Crawlers – Background.
E N D
(Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch Crawlers - Presentation 2 - April 2008
Crawlers 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions Crawlers - Presentation 2 - April 2008
Crawlers – Background • What is a crawler? • Collect information about internet pages • Near-infinite amount of web pages, no directory • Use links contained within pages to find out about new pages to visit • How do crawlers work? • Pick a starting page URL (seed) • Load starting page from internet • Find all links in page and enqueue them • Get any desired information from page • Loop Crawlers - Presentation 2 - April 2008
Crawlers – Background • Rules which apply on the Domain: • All crawlers have a URL Fetcher • All crawlers have a Parser (Extractor) • Crawlers are a Multi Threaded processes • All crawlers have a Crawler Manager • All crawlers have a Queue structure • Strongly related to the search engine domain Crawlers - Presentation 2 - April 2008
Unified Domain Class Diagram * Common features *Added by code modeling SpiderConfig Scheduler Spider ExternalDB Merger Queue Thread DB Robots StorageManager Extractor Filter Fetcher PageData CrawlerHelper Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 5
Unified Domain Sequence Diagram Fetching and extracting phase: Post-processing phase: Finish crawling phase: Pre-fetching phase: Pre-crawling phase: Main loop Optional objects! Optional object! End of main loop Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 6
Unified Domain - Applications • For the User Modeling group, the applications were the first chance to see things in practice • For the entire group, the applications provided a fresh view about the domain, which led to many changes (Assignment 2) • With everyone viewing the applications in the domain context, most differences were explained as being application-specific • Interesting experiment: Let new Code Modeling group use applications as basis for domain? Crawlers - Presentation 2 - April 2008
WebSphinx • WebSphinx: Website-Specific Processors for HTML INformation eXtraction (2002) • The WebSphinx class library provides support for writing web crawlers in Java • Designation: Small-scope crawls for mirroring, offline viewing, hyperlink trees • Extensible to saving information about page elements Crawlers - Presentation 2 - April 2008
WebSphinx Hyperlink Tree Crawlers - Presentation 2 - April 2008
Settings Robots Filters Spider, Queue (Configuration) Mirror Extractor Link: A link is a type of element, usually <A HREF=“”></A>, which points to a specific page or file. Storing information about each link relative to our seeds can help us analyze results Fetcher, PageData, StorageManager Thread Link Scheduler Element WebSphinx Mirror: A collection of files (Pages) intended to provide a perfect copy of another website Element: Web pages are composed of many elements (<element></element>). Elements can be nested (For example, <body> will have many child elements) Crawlers - Presentation 2 - April 2008
WebSphinx Crawlers - Presentation 2 - April 2008
Web Lech • Web Lech allows you to "spider" a website and to recursively download all the pages on it. Crawlers - Presentation 2 - April 2008
Web Lech • Web Lech is a fully featured web site download/mirror tool in Java, which supports : • download websites • emulate standard web-browser behavior Web Lech is multithreaded and will feature a GUI console. Crawlers - Presentation 2 - April 2008
Web Lech • Open Source MIT License means it's totally free and you can do what you want with it • Pure Java code means you can run it on any Java-enabled computer • Multi-threaded operation for downloading lots of files at once • Supports basic HTTP authentication for accessing password-protected sites • HTTP referrer support maintains link information between pages (needed to Spider some websites) Crawlers - Presentation 2 - April 2008
Web Lech • Lots of configuration options: • Depth-first or breadth-first traversal of the site • Candidate URL filtering, so you can stick to one web server, one directory, or just Spider the whole web • Configurable caching of downloaded files allows restart without needing to download everything again • URL prioritization, so you can get interesting files first and leave boring files till last (or ignore them completely) • Check pointing so you can snapshot spider state in the middle of a run and restart without lots of processing. Crawlers - Presentation 2 - April 2008
Class Diagram Crawlers - Presentation 2 - April 2008
Sequence Diagram Crawlers - Presentation 2 - April 2008
Common Features Crawlers - Presentation 2 - April 2008
Common Features Crawlers - Presentation 2 - April 2008
Unique Features Crawlers - Presentation 2 - April 2008
Grub Crawler • A Little bit about NASA’s SETI • What are distributed Crawlers? • Why distributed Crawlers? • Pros & Cons of distributed Crawlers Crawlers - Presentation 2 - April 2008
Class Diagram Crawlers - Presentation 2 - April 2008
Class Diagram (2) Spider & Thread Config & Robot Crawlers - Presentation 2 - April 2008
Class Diagram (3) Extractor Queue & Storage Manager Fetcher Crawlers - Presentation 2 - April 2008
Sequence Diagram Crawlers - Presentation 2 - April 2008
Sequence Diagram Crawlers - Presentation 2 - April 2008
Use Case Crawlers - Presentation 2 - April 2008
Aperture • Developing Year: 2005 • Designation: crawling and indexing • Crawl different information systems • Many common file formats • Flexible architecture • Main process phases: • Fetch information from a chosen source • Identify source type (MIME protocol) • Full-text and metadata extraction • Store and index information Crawlers - Presentation 2 - April 2008
Aperture Web Demo • Go to: http://www.dfki.unikl.de/ApertureWebProject/ Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 32
Aperture Class Diagram Spider, SpiderConfig, Queue • Interface name: • CrawlReport • Aperture’s unique! • Roll: Help crawler to keep necessary information about crawling changing status, fails and successes Extractor Types CrawlReport Extractor Crawler Types Thread, Scheduler, Robots • Class name: • Mime • Aperture’s unique! • Roll: Identify source type in order to choose the correct extractor. Aperture offers many extractors which are able to extract data and metadata from files,email,sites,calendars etc. StorageManager Mime • Classes name: • DataObject • RDFContainer • Aperture’s unique! • Roll: Represnet a source object after fetching it. Object includes source data and metadata in a RDF format. Aperture offers a crawler for each data source. Our domain focus on web !crawling Fetcher, CrawlerHelper DataObject DB RDFContainer CrawlerHelper Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 33
Aperture Sequence Diagram Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 34
Summary - ADOM • ADOM was helpful in establishing domain requirements • With better understanding of ADOM, abstraction became easier – level of abstraction was improved (increased) with each assignment • Using XOR and OR limitations on relations helpful in creating domain class diagram • Difficult not to get carried away with “It’s optional, no harm in adding it” decisions Crawlers - Presentation 2 - April 2008
Summary – Domain Modeling • Difficulty in modeling functional entities – functions are often contained within another class • Difficult to model when many optional entities exist, some of which heavily impact class relations and sequences • Vast difference in application scale • Next time, we’ll pick a different domain… Crawlers - Presentation 2 - April 2008
Crawlers • Thank you • Any questions? Crawlers - Presentation 2 - April 2008