1 / 21

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2)

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2). (c) Wolfgang Hürst, Albert-Ludwigs-University. Crawling - Recap from last time. General procedure : Continuously process a list of URLs and collect respective web pages and links that come along

zoey
Download Presentation

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search – Summer Term 2006IV. Web Search -Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

  3. Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

  4. Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

  5. Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

  6. Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection Page refresh: Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5]) Observations:- Frequent changes- Significant differences, e.g. among domains Hence: Update rule necessary

  7. 3. Page Refresh (Update Rules) Problem: The web is continuously changingGoal: Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers: Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers: Continuously crawl the web and incrementally update your collection

  8. 3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Freshness of a page pi at time t Freshness of a local collection P at time t

  9. 3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Age of a page pi at time t Age of a local collection P at time t

  10. 3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Time average of freshness of page pi at t Time average of freshness of a local collection P at time t (Time average of age: analogous)

  11. ELEMENT IS SYNCHRONIZED CHANGED Example for Freshness and Age FRESHNESS 1 0 AGE 0 (SOURCE: [6])

  12. FRESHNESS FRESHNESS TIME (MONTH) TIME (MONTH) BATCH MODE CRAWLER STEADY CRAWLER Design alternative 1: Batch mode vs. steady crawler Batch mode crawler:Periodic update of all pages of a collection Steady crawler:Continuous update Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)

  13. Design alternative 2: In-place vs. shadowing Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded Shadowing keeps two collections: The crawlers collection and the current collection BATCH MODE CRAWLER STEADY CRAWLER

  14. Design alternative 3: Fixed vs. variable frequency Fixed frequency / uniform refresh policy: Same access rate to all pages (independent of their actual rate of change) Variable frequency: Access pages depending on their rate of change Example: Proportional refresh policy

  15. Variable frequency update Obvious assumption for a good strategy: Visit a page that changes frequently more often Wrong!!! The optimum update strategy (if we assume a distribution of Poisson) looks like this: OPTIMUM UPDATE TIME RATE OF CHANGE OF A PAGE

  16. Variable frequency update (cont.) Why is this a better strategy? Illustration with a simple example: P1 P2

  17. Steady In-place update Variable frequency vs. vs. vs. Batch-mode Shadowing Fixed frequency Summary of different design alternatives

  18. 3.3 Expl. for an Incremental Crawler Two main goals: - Keep the local collection freshRegular, best-possible updates of the pages in the index - Continuously improve the quality of the collectionReplace existing pages with low quality through new pages with higher quality

  19. 3.3 Expl. for an Incremental Crawler WHILE (TRUE) URL = SELECT_TO_CRAWL (ALL_URLS); PAGE = CRAWL (URL); IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE)ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL} NEW_URLS = EXTRACT_URLS (PAGE); ALL_URLS = ALL_URLS U NEW_URLS;

  20. RANKINGMODULE UPDATEMODULE ADD/REMOVE SCAN POP ALL_URLS PUSH BACK COLL_URLS SCAN DISCARD COLLECTION CHECK SUM CRAWL UPDATE/SAVE CRAWLMODULE ADD_URLS 3.3 Expl. for an Incremental Crawler

  21. References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000) [5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003 [6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000

More Related