1 / 11

(c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab: http://www.researchchannel.org/prog/displayevent.asp?rid=2459. (c) Wolfgang Hürst, Albert-Ludwigs-University. INFORMATION. INFORMATION NEED. DATA / DOCUMENTS. QUERY. IR vs. Web Search.

arvid
Download Presentation

(c) Wolfgang Hürst, Albert-Ludwigs-University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search – Summer Term 2006III. Web Search - Introduction (Cont.)-Jeff Dean, Google's Systems Lab:http://www.researchchannel.org/prog/displayevent.asp?rid=2459 (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. INFORMATION INFORMATION NEED DATA / DOCUMENTS QUERY IR vs. Web Search Initial problem is similar to traditional IR ... The no. of users ishuge. Very huge. The web is huge.Very huge. Big variety in users Big variety in data Users don't cooperate (short queries, ...) Doc. authors don't cooperate (spam,...) .. but basic conditions & characteristics differ significantly

  3. Classic IR vs. Web Search: Documents Hugh amount of data, continuous growth, high rate of change Hugh variability and heterogeneity- Quality, credibility and reputation of the source- Static vs. dynamic docs- Different media types (text, pics, audio, video)- Different formats (HTML, Flash, PDF, ...)- Miscellaneous topics- Continuous text vs. note form / keywords- Different languages, encoding Spam and advertisements Web-specific characteristics- Hypertext, linking- Broken links- Unstructured, not always conform with standards Redundancy (syntactic and semantic) Distributed (need to collect them automatically) Different popularity and access frequency

  4. Classic IR vs. Web Search: Users Different needs and aims, e.g. users might want- to learn s.th. ("informational")- to go to a particular site ("navigational")- to do s.th., e.g. shopping, download, ... ("transactional")- to do other, miscellaneous things, e.g. finding hubs, "exploratory search", ... Different premises, qualifications, languages, ... Different network connection / bandwidths Imprecise, unspecific queriesShort, ambiguous, inexact, incorrect, no usage of operators or special syntax Classic IR vs. Web Search: Bottom line Different characteristics that cause lots of problems But there's also good news: We can take advantage of some of these characteristics (e.g. links, statistics, ...)

  5. References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 1 (Introduction),Chapter 4.1 (Google Architecture Overview)

  6. General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)

  7. DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION

  8. The Google Search Engine Founded 1998 (1996) by two Stanford students Originally academic / research project that later became a commercial tool Distinguishing features (then!?): - Special (and better) ranking - Speed - Size

  9. SORTERS CRAWLERS BARRELS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS DUMPLEXICON URL RESOLVER LEXICON DOC INDEX LINKS PAGERANK (CF. [2], FIG. 1)

  10. Schedule Web Search: - Introduction - Crawling - Page Repository - Indexing - Ranking (PageRank, HITS) - Exercises for web search basics - Advanced / additional web search topics In parallel: - Programming project (Lucene)

  11. References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 1 (Introduction),Chapter 4.1 (Google architecture overview)

More Related