Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2)

Web Search – Summer Term 2006IV. Web Search -Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University

Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure: Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection: Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection Page refresh: Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5]) Observations:- Frequent changes- Significant differences, e.g. among domains Hence: Update rule necessary

3. Page Refresh (Update Rules) Problem: The web is continuously changingGoal: Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers: Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers: Continuously crawl the web and incrementally update your collection

3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Freshness of a page pi at time t Freshness of a local collection P at time t

3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Age of a page pi at time t Age of a local collection P at time t

3.2 Incremental Crawlers Main Goal: Keep local collection up-to-date Two measures: Freshness and Age Time average of freshness of page pi at t Time average of freshness of a local collection P at time t (Time average of age: analogous)

ELEMENT IS SYNCHRONIZED CHANGED Example for Freshness and Age FRESHNESS 1 0 AGE 0 (SOURCE: [6])

FRESHNESS FRESHNESS TIME (MONTH) TIME (MONTH) BATCH MODE CRAWLER STEADY CRAWLER Design alternative 1: Batch mode vs. steady crawler Batch mode crawler:Periodic update of all pages of a collection Steady crawler:Continuous update Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)

Design alternative 2: In-place vs. shadowing Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded Shadowing keeps two collections: The crawlers collection and the current collection BATCH MODE CRAWLER STEADY CRAWLER

Design alternative 3: Fixed vs. variable frequency Fixed frequency / uniform refresh policy: Same access rate to all pages (independent of their actual rate of change) Variable frequency: Access pages depending on their rate of change Example: Proportional refresh policy

Variable frequency update Obvious assumption for a good strategy: Visit a page that changes frequently more often Wrong!!! The optimum update strategy (if we assume a distribution of Poisson) looks like this: OPTIMUM UPDATE TIME RATE OF CHANGE OF A PAGE

Variable frequency update (cont.) Why is this a better strategy? Illustration with a simple example: P1 P2

Steady In-place update Variable frequency vs. vs. vs. Batch-mode Shadowing Fixed frequency Summary of different design alternatives

3.3 Expl. for an Incremental Crawler Two main goals: - Keep the local collection freshRegular, best-possible updates of the pages in the index - Continuously improve the quality of the collectionReplace existing pages with low quality through new pages with higher quality

3.3 Expl. for an Incremental Crawler WHILE (TRUE) URL = SELECT_TO_CRAWL (ALL_URLS); PAGE = CRAWL (URL); IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE)ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL} NEW_URLS = EXTRACT_URLS (PAGE); ALL_URLS = ALL_URLS U NEW_URLS;

RANKINGMODULE UPDATEMODULE ADD/REMOVE SCAN POP ALL_URLS PUSH BACK COLL_URLS SCAN DISCARD COLLECTION CHECK SUM CRAWL UPDATE/SAVE CRAWLMODULE ADD_URLS 3.3 Expl. for an Incremental Crawler

References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000) [5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003 [6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2)

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2)

Presentation Transcript

Neverending Search:

Expanding Square Search Pattern

Search Engine Technology

Mid-term Review Chapters 2-7

Search Engine

Search and Seizure

Chapter Overview Search

The Future of Search

Solving problem by search

Search Engine Optimization (SEO)

How to make this cat purr

159.741 STATE-SPACE SEARCH

Lecture 02 – Part A Problem Solving by Searching Search Methods : Uninformed (Blind) search

Search for New Physics at LHC

Graphs, Goals and NPCs

Chapter 3 Best First Search

Search Engine Optimization (SEO)

Job Search Boot Camp Practical Advice for Your Job Search

Binary Search Trees

Lecture 02 – Part A Problem Solving by Searching Search Methods : Uninformed (Blind) search

Chapter Overview Search