340 likes | 546 Views
Web Categorization Crawler – Part I. Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010. Contents. Crawler Overview Introduction and Basic Flow Crawling Problems Project Technologies Project Main Goals System High Level Design
E N D
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor KulikovWinter 2009/10 Final Presentation Sep. 2010 Web Categorization Crawler
Contents • Crawler Overview • Introduction and Basic Flow • Crawling Problems • Project Technologies • Project Main Goals • System High Level Design • System Design • Crawler Application Design • Frontier Structure • Worker Structure • Database Design - ERD of DB • Storage System Design • Web Application GUI • Summary Web Categorization Crawler
Crawler Overview – Intro. • A Web Crawler is a computer program that browses the World Wide Web in a methodical automated manner • The Crawler starts with a list of URLs to visit, called the seeds list • The Crawler visits these URLs and identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the frontier • URLs from the frontier are recursively visited according to a predefined set of policies Web Categorization Crawler
Crawler Overview – Basic Flow • The basic flow of a standard crawler is as seen in the illustration and as follows: • The Frontier, that contains the URLs to visit, is Initialized with seed URLs • A URL is picked from the frontier and the page with that URL is fetched from the internet • The page that has been fetched is parsed in order to: • Extract hyperlinks from the page • Process the page • Add extracted URLs to Frontier Web Categorization Crawler
Crawling Problems • The World Wide Web contains a large volume of data • Crawler can only download a fraction of the Web pages • Thus there is a need to prioritize and speed up downloads, and crawl only the relevant pages • Dynamic page generation • May cause duplication in content retrieved by the crawler • Also causes a crawler traps • Endless combination of HTTP requests to the same page • Fast rate of Change • Pages that were downloaded may have been changed since the last time they were visited • Some crawlers may need to revisit the pages in order to keep up to date data Web Categorization Crawler
Project Technologies • C# (C Sharp), asimple, modern, general-purpose, and object oriented programming language • ASP.NET, a web application framework • Relational Data Base • SQL, a database computer language for managing data • SVN, a revision control system to maintain current and historical versions of files Web Categorization Crawler
Project Main Goals • Design and implement a scalable and extensible crawler • Multi-threaded design in order to utilize all the system resources • Increase the crawler’s performance by implementing an efficient algorithms and data structures • The Crawler will be designed in a modular way, with expectation that new functionality will be added by others • Build a friendly web application GUI including all the features supported for the crawl progress Web Categorization Crawler
System High Level Design • There are 3 major parts in the System • Crawler (Server Application) • StorageSystem • Web Application GUI (User) Crawler Frontier worker1 worker2 . . . worker3 Load Configurations Store Results Main GUI View results Storage System Store Configurations Data Base Web Categorization Crawler
Crawler Application Design • Maintains and activates both of the Frontier and the Workers • The Frontier is the data structure that holds the urls to visit • A Worker’s role is to fetch and process pages • Multi Threaded • There are predefined number of Worker threads • There is a single Frontier thread • Requires to protect the shared resources from simultaneous access • The shared resource between the Workers and the Frontier is the Queue that holds the urls to visit Web Categorization Crawler
Frontier Structure • Maintains the data structure that contains all the Urls that have not been visited yet • FIFO Queue* • Distributes the Urls uniformly between the workers Frontier Queue Worker Queues F Is Seen Test Route Request T Delete Request Web Categorization Crawler (*) first implementation
Worker Structure • The Worker fetches a page from the Web and processes the fetched page with the following steps: • Extracting all the Hyper links from the page. • Filtering part of the extracted Urls. • Ranking the Url* • Categorizing the page* • Writing the results to the data base. • Writing back the extracted urls back to the frontier. Page Ranker Extractor URL filter DB Categorizer Frontier Queue Worker Queue Fetcher Web Categorization Crawler (*) will be implemented at part II
Class Diagram of Worker Web Categorization Crawler
Class Diagram Of Worker-Cont. Web Categorization Crawler
Class Diagram Of Worker-Cont. Web Categorization Crawler
ERD of Data Base • Tables in the Data Base: • Task, contains basic details about the task • TaskProperties, contains the following properties about a task : Seed list, allowed networks, restricted networks* • Results, contains details about the results that the crawler have reached to them • Category, contains details about all the categories that have been defined • Users, contains details • about the users of the system** Web Categorization Crawler (*) Any other properties can be added and used easily(**) Not used in the current GUI
Storage System • Storage System is the connector class between the GUI and the Crawler to the DB • Using the Storage System you can save data into the data base, or you can extract data from the data base • The Crawler uses the Storage System to extract the configurations of a task from the DB, and to save the results to the DB • The GUI uses the Storage System to save configurations of a task into the DB, and to extract the results from the DB Web Categorization Crawler
Class Diagram of Storage System Web Categorization Crawler
Web Application GUI • Simple and Convenient to use • User Friendly • User can do the following: • Edit and create a task • Launch the Crawler • View the results that the crawler has reached • Stop the Crawler Web Categorization Crawler
Web Categorization Crawler – Part II Mohammed Agabaria Adam Shobash Supervisor: Victor KulikovSpring 2009/10 Final Presentation Dec. 2010 Web Categorization Crawler
Contents • Reminder From Part I • Crawler Overview • System High Level Design • Worker Structure • Frontier Structure • Project Technologies • Project Main Goals • Categorizing Algorithm • Ranking Algorithm • Motivation • Background • Ranking Algorithm • Frontier Structure – Enhanced • Ranking Trie • Basic Flow • Summary Web Categorization Crawler
Reminder: Crawler Overview • A Web Crawler is a computer program that browses the World Wide Web in a methodical automated manner • The Crawler starts with a list of URLs to visit, called the seeds list • The Crawler visits these URLs and identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the frontier • URLs from the frontier are recursively visited according to a predefined set of policies Web Categorization Crawler
Reminder: System High Level Design • There are 3 major parts in the System • Crawler (Server Application) • StorageSystem • Web Application GUI (User) Crawler Frontier worker1 worker2 . . . worker3 Load Configurations Store Results Main GUI View results Storage System Store Configurations Data Base Web Categorization Crawler
Reminder: Worker Structure • The Worker fetches a page from the Web and processes the fetched page with the following steps: • Extracting all the Hyper links from the page. • Filtering part of the extracted URLs. • Ranking the URL • Categorizing the page • Writing the results to the data base. • Writing back the extracted urls back to the frontier. Page Ranker Extractor URL filter DB Categorizer Frontier Queue Worker Queue Fetcher Web Categorization Crawler
Reminder: Frontier Structure • Maintains the data structure that contains all the Urls that have not been visited yet • FIFO Queue* • Distributes the Urls uniformly between the workers Frontier Queue Worker Queues F Is Seen Test Route Request T Delete Request Web Categorization Crawler (*) first implementation
Project Technologies • C# (C Sharp), asimple, modern, general-purpose, and object oriented programming language • ASP.NET, a web application framework • Relational Data Base • SQL, a database computer language for managing data • SVN, a revision control system to maintain current and historical versions of files Web Categorization Crawler
Project Main Goals • Support Categorization of the web pages, which tries to match the given content to predefined categories • Support Ranking of the web pages, which means building a ranking algorithm that evaluates the relevance (rank) of the extracted link based on the content of the parent page • A new implementation of the frontier, which passes on the requests according to their rank, should be fast and memory efficient data structure Web Categorization Crawler
Categorization Algorithm • Tries to match the given content to predefined categories • Every category is described by a list of keywords • The final match result has two factors: • Match Percent which describes the match between the category keywords and the given content: • Non-Zero match which describes how many different keywords appeared in the content: • The total match level of the content to category is obtained from the sum of the two factors aforementioned : * each keyword has max limit of how many times it can appear, any additional appearances won’t be counted Web Categorization Crawler
Categorization Algorithm cont. • Overall Categorization progress when matching a certain page to specific category NonZero Bonus Category Keywords Page Content WordList Keyword1 NonZero Calculator Keyword2 Total Match Level Keyword3 . . Matcher Calculator Keyword n Match Percent Web Categorization Crawler
Ranking Algorithm - Motivation • The World Wide Web contains a large volume of data • Crawler can only download a fraction of the Web pages • Thus there is a need to prioritize downloads, and crawl only the relevant pages • Solution: • To give every extracted url a rank according to it’s relevance to the categories that defined by the user • The frontier will pass on the urls with higher rank • Relevant pages will be visited first • The quality of the Crawler depends on the correctness of the ranker Web Categorization Crawler
Ranking Algorithm - Background • Ranking is a kind of prediction • The Rank must be given to the url when it is extracted from a page • It is meaningless to give the page a rank after we have downloaded it • The content of the url is unavailable when it is extracted • The crawler didn’t download it yet • The only information that we can assist of, when the url is extracted, is the page from which the url has been extracted (aka the parent page) • Ranking will be done according to the following factors* • The rank given to the parent page • The relevance of the parent page content • The relevance of the nearby text content of the extracted url • The relevance of the anchor of the extracted url • Anchor is the text that appears on the link * Based on SharkSearch Algorithm Web Categorization Crawler
Ranking Algorithm – The Formula* • Predicts the relevance of the content of the page of the url extracted • The final rank of the url depends on the following factors • Inherited, which describes the relevance of the parent page to the categories: • Neighborhood, which describes the relevance of the nearby text and the anchor of the url: • While ContextRank is given by: • The total rank given to the extracted url is obtained from the aforementioned factors: * Based on SharkSearch Algorithm Web Categorization Crawler
Frontier Structure – Ranking Trie • A customized data structure that saves the url requests efficiently • Holds two sub data structures • Trie, a data structure that holds url strings efficiently for already seen test • RankTable, array of entries, each entry holds a list of all the url requests that have the same rank level which is specified by the array index • Supports url seen test in O(|urlString|), • every seen url is being saved in the trie • Supports passing on first the urls with higher rank in O(1) Web Categorization Crawler
Frontier Structure - Overall • The Frontier is based on the RankingTrie data structure • Saves\updates all the new forwarded requests into the ranking trie • When a new url request arrives, the frontier just adds it to the RankingTrie • When the frontier need to route a request, it gets the high ranked request saved in the RankingTrie and routes it to the suitable worker queue Frontier Queue Worker Queues Ranking Trie Route Request Web Categorization Crawler
Summary • Goals achieved: • Understanding ranking methods • Especially the Shark Search • Implementing categorizing algorithm • Implementing efficient frontier which supports ranking • Implementing a multithreaded Web Categorization Crawler with full functionality Web Categorization Crawler (*) will be implemented at part II