170 likes | 318 Views
Web Search – Summer Term 2006 IV. Web Search - Crawling. (c) Wolfgang Hürst, Albert-Ludwigs-University. CRAWL CONTROL. General Web Search Engine Architecture. CLIENT. WWW. PAGE REPOSITORY. QUERIES. RESULTS. QUERY ENGINE. RANKING. CRAWLER(S). COLLECTION ANALYSIS MOD. INDEXER MODULE.
E N D
Web Search – Summer Term 2006IV. Web Search -Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University
CRAWL CONTROL General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)
Crawler (Robots, Spiders) - 1. Intro Goal: Get web pages for indexing Basic procedure (simplified): 1. Given: Initial set of URLs U (in some order) 2. Get next URL u from U 3. Download web page p(u) 4. Extract all URLs from p(u), add them to U 5. Send p(u) to the indexer 6. Continue with 2. until U is empty (or some stop criteria is fulfilled)
1. Introduction (Cont.) Probl.: The web is too big and changes too fast Page selection based on- Coverage (absolute vs. relative)- Quality (e.g. index "good" pages)- Efficiency (e.g. no duplicates)- Etiquette (e.g. minimize server loads)- Freshness (Update how often? What?) Pragmatic issues:- Parallelization of the crawling process- Parsing web pages- Defending spam
2. Page Selection Rules Which pages should we download?Goal: Download only "important" pages Questions:- How can we describe importance?- How can we estimate importance?- How can we judge the quality of diff. crawlers? To answer these questions we need:1. A mathematical model / measure for importance2. A selection criteria that maximizes importance based on this measure3. A measure to compare the performance of different crawlers
2.1 Importance Metrics Interest-driven Metric IS(P):Index pages with a certain interest for you users- Use traditional vector model Problem: Requires queries Q and estimated IDFs- Alternatively: Use hierarchy of topics (estimation of topic based on link structure) Popularity-driven Metric IB(P):Index popular pages- Popularity based on (e.g.) backlinks or PageRank Location-driven Metric IL(P):Index based on local information (URL)- Expls: Suffix (.com, .edu, ...), no. of slashes, ...
2.2 Ordering Metrics Goal: Sort URLs in such a way as to get the most important subset of pages in the end Problem: Requires an estimation of the importance of the respective web page For popularity-driven metrics IB(P):- E.g. use numbers of backlinks seen so far For location-driven metrics IL(P):- All required information is available! For similarity-/interest-driven metrics IS(P):- Needs queries, estimated IDF, and guess about the page's content (e.g. via anchor text or text on page that contains the link)
Expl. for a Crawling Algorithm ENQUEUE (URL_QUEUE, STARTING_URL); WHILE (NOT EMPTY (URL_QUEUE)) { URL = DEQUEUE(URL_QUEUE); PAGE = CRAWL_PAGE (URL); ENQUEUE (CRAWLED_PAGES, (URL, PAGE)); URL_LIST = EXTRACT_URLS (PAGE); FOR EACH U IN URL_LIST ENQUEUE (LINKS, (URL, U)); IF (U NOT IN URL_QUEUE) AND ((U, -) NOT IN CRAWLED_PAGES) THEN ENQUEUE (URL_QUEUE, U); REORDER_QUEUE (URL_QUEUE), } (SEE FIGURE 1 IN [3])
2.3 Quality Metrics Quality metric to describe the performance of a crawler: Distinguish two cases 1. Crawl & Stop: Crawler gets K pages- Perfect crawler: R1, R2, ..., RK with I(Ri) I(Rj)- Real crawler: Only delivers M K of these Ri pagesDefinition: Performance P of crawler CP(C) = (M * 100) / KRandom crawler: P(C) = (K * 100) / T with T = no. of pages in the web 2. Crawl & Stop with Threshold: Define importance target G and get pages with I(P)>G(see literature, i.e. [1], section 2.1.2)
Example: Stanford WebBase Crawler Data base: 225.000 Stanford University web pages Crawler: Stanford WebBaseCrawler (with different ordering metrics) Importance metric: IB(P) Quality metric: Crawl & Stop with Threshold SEE [1]
3. Page Refresh (Update Rules) Problem: The web is continuously changingGoal: Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers: Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers: Continuously crawl the web and incrementally update your collection
3.1 Change Frequency of the Web Experiment (Stanford) to answer the following questions:- How long is the lifespan of a web page?- How often do web pages change?- How long does it take till (e.g.) 50% of all web pages change?- Are there any mathematical models to describe these changes? Experiment with a data base of- 720,000 pages from 270 sites- ca. 3,000 pages per site ("window of pages")- Sites selected based on popularity (PageRank) and only with owner's permission- Archived over 4 months (1 time daily) SOURCE OF FOLLOWING DIAGRAMS: CHO & GARCIA-MOLINA [4]
3.1 Change Frequency of the Web How often do web pages change? (OVERALL) (DOMAIN DEPENDENT) Observations:- Pages change rather frequently- Significant differences between different domains (.com, .org, .edu, .gov)
3.1 Change Frequency of the Web How long is the lifespan of a web page? (OVERALL) (DOMAIN DEPENDENT) Note: Only "visible" lifespan is observed here Two methods have been used to estimate the lifespan over 4 months
3.1 Change Frequency of the Web How long does it take till (e.g.) 50% of all web pages change? (DOMAIN DEPENDENT) (OVERALL) Conclusion: Especially the clear differences in the domain dependent case suggest considering change frequency during crawling
3.1 Change Frequency of the Web Are there mathematical models to describe changes? Assumption: Change frequency follows the distribution of Poisson With this: Estimate probability that a page changes at a particular time t CHANGE INTERVALS OF PAGES FOR THE PAGES THAT CHANGE ... ... EVERY 10 DAYS ON AVERAGE ... EVERY 20 DAYS ON AVERAGE
References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000)