720 likes | 935 Views
How to Build a Search Engine. 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01. Outline. Introduction Different Kinds of Search Engine Architecture Robot, Spider, Crawler HTML and HTTP Indexing Keyword Search Evaluation Criteria Related Work Discussion
E N D
How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷(Kung-Ming Fung) kmfung@doubleservice.com 2008/04/01
Outline • Introduction • Different Kinds of Search Engine • Architecture • Robot, Spider, Crawler • HTML and HTTP • Indexing • Keyword Search • Evaluation Criteria • Related Work • Discussion • About Google • Ajax: A New Approach to Web Applications • References
Different Kinds of Search Engine • Directory Search • Full Text Search • Web pages • News • Images • … • Meta Search
Number of Page:Directory < Full text < Meta • Directory Search 目錄式 • ODP:Open Directory Project,http://dmoz.org/ • Full-Text Search 全文檢索 • Google,http://www.google.com/
Meta Search 整合型 • MetaCrawler,http://www.metacrawler.com/ • 愛幫,http://www.aibang.com/
Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search,http://www.neci.nec.com/~lawrence/papers.html
Architecture WWW Robot, Spider, Crawler Database Indexing Simple Architecture Keyword Search
Typical high-level architecture of a Web crawler Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Typical anatomy of a large-scale crawler. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
High Level Google Architecture Reference: A Survey On Web Information Retrieval Technologies
The architecture of a standard meta search engine. Reference: Web Search – Your Way
The architecture of a meta search engine. Reference: Web Search – Your Way
Cyclic architecture for search engines Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Robot, Spider, Crawler • Robot是Search Engine中負責資料收集的軟體,又稱為Spider、或Crawler,他可以自動在設定的期限內定時自各網站收集網頁資料,而且通常是由一些預定的起始網站開始遊歷其所連結的網站,如此反覆不斷(recursive)的串連收集。 • A major performance stress is DNS lookup.
Goal • Resolving the hostname in the URL to an IP address using DNS(Domain Name Service). • Connecting a socket to the server and sending the request. • Receiving the request page in response.
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Amount of static and dynamic pages at a given depth Dynamic pages: 5 levels Static pages: 15 levels Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Policy • A selection policy that states which pages to download. • A re-visit policy that states when to check for changes to the pages. • A politeness policy that states how to avoid overloading Web sites. • A parallelization policy that states how to coordinate distributed Web crawlers. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
The view of Web Crawler Reference: Structural abstractions of hypertext documents for Web-based retrieval
Flow of a basic sequential crawler Reference: Crawling the Web.
A multi-threaded crawler model Reference: Crawling the Web.
HTML and HTTP • HTML – Hypertext Markup Language • HTTP – Hypertext Transport Protocol • TCP – Transport Control Protocol • HTTP is built on top of TCP. • Hyperlink • A hyperlink is expressed as an anchor tag with an href attribute. • <a href=“http://www.ntust.edu.tw/”>NTUST</a> • URL – Uniform Resource Locator(http://www.ntust.edu.tw/)
GET / Http/1.0 Http/1.1 200 OK Date: Sat, 13 Jan 2001 09:01:02 GMT Server: Apache/1.3.0 (Unix) PHP/3.0.4 Last-Modified: Wed, 20 Dec 2000 13:18:38 GMT Accept-Ranges: bytes Content-Length: 5437 Connection: Close Content-Type: text/html <html> <head> <title>NTUST</title> </head> <body> … </body></html>
For checking a URL Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Operation of a crawler Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.
Get new URLs Reference: Crawling on the World Wide Web.
HTML Tag Tree Reference: Crawling the Web.
HTML Tag Tree Reference: Crawling the Web.
Strategies • Breadth-first • Backlink-count • Batch-pagerank • Partial-pagerank • OPIC(On-line Page Importance Computation ) • Larger-sites-first Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Re-visit policy • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.
Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.
Robot Exclusion http://www.robotstxt.org/wc/exclusion.html • The robots exclusion protocol • The robots META tag
The Robots Exclusion Protocol - /robots.txt • Where to create the robots.txt file?EX:
URL's are case sensitive, and "/robots.txt" must be all lower-case • Examples: • To exclude all robots from the entire server User-agent: * Disallow: / • To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/
To exclude a single robot User-agent: BadBot Disallow: / • To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: /
To exclude all files except one User-agent: * Disallow: /~joe/docs/User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/
The Robots META Tag • <meta name="robots" content="noindex,nofollow"> • Like any META tag it should be placed in the HEAD section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...
Examples: • <meta name="robots" content="index,follow"> • <meta name="robots" content="noindex,follow"> • <meta name="robots" content="index,nofollow"> • <meta name="robots" content="noindex,nofollow"> • Index: if an indexing robot should index the page • Follow: if a robot is to follow links on the page • The defaults are INDEX and FOLLOW.
Indexing 索引 • 一般而言,索引的產生是將網頁中每個Word或者Phrase存入Keyword索引檔中,另外除了來自網頁內容外,網頁作者所自行定義Meta Tag中的Keyword也常被納入索引範圍。 • TF, IDF, Reverse(Inverted) Index • Stop words
(b) is a inverted index of (a) Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing
d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10. • d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10. • tid: token ID • did: document ID • pos: position Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
my -> d1 care -> d1; d2 is -> d1; d2 loss -> d1 of -> d1; d2 with -> d1; d2 old -> d1 done -> d1 your -> d2 gain -> d2 new -> d2 won -> d2 my -> d1/1 care -> d1/2,6,9; d2/2,6,9 is -> d1/3; d2/3 loss -> d1/4 of -> d1/5; d2/5 with -> d1/7; d2/7 old -> d1/8 done -> d1/10 your -> d2/1 gain -> d2/4 new -> d2/8 won -> d2/10 My care is loss of care with old care done. d1 Your care is gain of care with new care won. d2 Two variants of the inverted index data structure. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
Usually stored on disk • Implemented using a B-tree or a hash table
Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
Keyword Search 查詢 • 檢索軟體是決定Search Engine是否能普遍為人使用的關鍵要素,因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞,而這些工作都屬於檢索軟體的範圍。 • 人工智慧、自然語言 • Ranking:PageRank、HITS • Query Expansion
WAIS: • 廣域資訊服務(Wide Area Information System;WAIS)是一套可以建立全文索引,並提供網路資源全文檢索功能的軟體,其主要由伺服器(Server)、用戶端(Client)、協定(Protocol)等三部份組成 。 • 查詢方式: • 關鍵字(Keyword) • 以概念為基礎的(Concept-based) • 模糊(Fuzzy) • 自然語言(Natural Language)
PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank. Reference: A Survey On Web Information Retrieval Technologies
We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: Reference: A Survey On Web Information Retrieval Technologies