搜尋引擎簡介

搜尋引擎簡介 國立中正大學資訊工程研究所吳昇副教授(sw@cs.ccu.edu.tw)

What is a search engine? • A web service site for the Internet Users to find information in the Internet Cyberspace。 • The software to provide web search service

Use of search engines? • Search for the url of a company/website • Look for the contact info about a person or an organization • Search for information related to a term, eg. to collect information about 櫻花鉤吻鮭 • Look for news regarding XXX • Treat the search engine as a big dictionary • …...

Types of search engines • Directory browse/search • Web pages search • USENET news search • Ftp search • People/organization search • Daily-life information search • Library search • Commercial product search

Yahoo, Google, AltaVista, MSN, Excite, Lycos, ... YAM, Kimo, PCHome GAIS, Openfind, ... DejaNews, Archie, ... Example search engines

Portal Services • Directory / Search • Daily information: Weather, Maps. TV, ... • Free Emails, Free Pages, Calendar • Personalized services, channel subscription • Web Chat, • E-Commerce, • Content Aggregation • …...

Directory implementation • Each url data is a record • The url data is managed by a database system • Search function is supported for searching the data in the directory tree

Directory implementation • The search is in general for locating a website or a category of web sites. • The data input is through manual registration by the website owner or the suffer • The management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge

The Advantages/Disadvantages of Directory search engine • Advantages • The data is manually maintained, and thus contains less noise, and is more precise. • The output of search can be categorized and can be more organized • Can support search within a category

The Advantages/Disadvantages of Directory search engine • Disadvantages: • The data coverage is limited, and sometimes, can not find wanted • Does not support relevance ranking • Labor intensive

Implementation of Webpage search engine 1.Feature consideration 2.Data Gathering 3.Data Preprocessing 4.Data Indexing 5.Query Processing 6. Interaction 7.Service tools 8.Personalization

Requirements for WebPage search engines 0. The quality of the search result in a search engine basically depends on • a. the quality of the underlying data • b. the search techniques such as ranking tech. 1. Data coverage should be large enough 2. Data needs to be filtered, such as removing redundant pages

Requirements for WebPage search engines 3. Full text search capability should be provided 4. Relevance Ranking mechanism should be provided 5. Search Speed should be fast enough 6. Search features ; I.e., evaluation points: Quality, speed, scale, robustness, features,

Data Gatherer • Also known as spider, crawler, robot, ... • Periodically travels the web space to collect web pages • Need a list management to decide which and when to collect • Need a link analyzer to generate new URL list • Need to decide what to collect and what not to.

Data Gatherer • Get-file function through http protocol is the basic function • Webpage parser module used to extract link info from a retrieved page, • URL bank manager module to manage the urls to be fetched. • Robot-controller module to manage the data collection using multiple clients

Issues of Robot • Site Based vs URL based • Site based is popular such as wget, teleport • robots.txt is easier to implement in SiteBased robot • URL based robot is more appropriate for large scale search engines • Retrieval Schedule, BFS is better • Incremental Retrieval

Robot Issues • What to gather and what not to? • Hidden web data collection • Focused crawling • targeting specialized content of web pages • suitable for special search engines • evaluated by precision and recall

Data Preprocessing • Remove redundant pages • Transform the page into internal data format. • Perform web cross-link analysis to generate a URL databank. • Filter the data to remove data that better not be indexed • Partition the data space*

Redundancy removal • 15% to 20% of the web pages are replicated on different websites, e.g., some tutorials such as Java, Perl, Python, … • Can be implemented by partitioned-hashing or external sorting

Ranking the URLs • Link analysis is done to count the mutual reference between web pages • A URL receiving higher number of references will get higher score • weighted link • discount internal link // such as back to home • Order the web pages in order of score such that a page with higher rank will have lower ID

Data Partition • The data is partitioned by language type • The language partition can be done as follows: • for each known language, collect certain amount of webpages of that language • build up high-frequent term set for each language set from the analysis of the sample data • determine the language type by term analysis

Indexer • In general, inverted file is used to generate the index • Need large data space for the indexing task. • For each indexed term, an index list is generated to record which files/locations such term appears. • Need about the same or more space as the original data

Indexer - implementation issue • Data filter module is used to cope with different data sources • Inversion module is the kernel module • Need to be scalable to handle continuous growing data size. • Hundreds of Giga bytes • Tera bytes • Distributed/Concurrent Indexing

Indexer - implementation issue • Temporary space minimization • Index speed is crucial • Memory can be utilized to improve the index performance • Hashing and Sorting is the key!

Query Processing • Use dictionary/stop-list to preprocess the query string • Parse the query into expressions of tokens • Use index structure to locate the matched • Use TF*IDF type technique to score the matched documents • Combine URL scores to rank the result

Search CGI programs • search agent CGI: • parse the query and fork a searcher process to do the search (or use IPC to query the searcher) • when the searcher returns, analyze and process the result for formatted output • process the result and store it in tmp result store • log query and some status info • cgi for view-next-page • showmatch cgi

Output control • Site grouping: group the pages from same website together • Title grouping: group the pages with similar title • Sort the output according to certain criteria

Interaction • Term Suggestion: • Related terms • thesaurus • term-expansion • error correction • phonetic • spelling

Personalization • Keeping track of a user’s interest such that the search result can be tuned to improve the satisfaction to the user • Query Tracking and classification

Service tools • Query cache to improve the performance of the Search, for queries that have been served. • Use memory cache file system to reduce the dick access overhead • Mechanism for special case handling • Log analyzer

Research Issues • Hidden Web data collection • Distributed index/search • Index minimization, incremental Indexing • Smart robot • Intelligent Retrieval • Output result auto classification/clustering • Data source clustering/classification • classifying/clustering the whole web

Conclusion • Size does matter • Is still searching for a better engine!

搜尋引擎簡介

搜尋引擎簡介

Presentation Transcript