320 likes | 522 Views
Search Engine Survey. Hongfei Yan 2/15/2007. Outline. Background Information Definition, history, how search engines work General Search Engines Interface, databases, features Google, Yahoo!, Baidu, Live Open Source Search Engines Lucence, SWISH-E
E N D
Search Engine Survey Hongfei Yan 2/15/2007
Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines
Definition of Search Engine • A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the Web, inside a corporate or proprietary network, or in a personal computer. • The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. • This list is often sorted with respect to some measure of relevance of the results. • Search engines use regularly updated indexes to operate quickly and efficiently. • search engine usually refers to a Web search engine, which searches for information on the public Web.
Timeline of Search Engines “Full text” crawler-based Link popularity and PageRank
How search engines work • Web crawling • an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. • Indexing • The contents of each page are analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). • Searching • When a user comes to the search engine and makes a query, the engine looks up the index and provides a listing of best-matching web pages according to its criteria
Storage costs and crawling time • Storage costs are not the limiting resource in search engine implementation. • Simply storing 10 billion pages of 10 kbytes each (compressed) requires 100TB and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives. • a public search engine requires considerably more resources than this to calculate query results and to provide high availability. • Also, the costs of operating a large server farm are not trivial. • Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection.
Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines
General Search Engine • Primary Search Engines • they are either well-known and well-used. • they can potentially generate so much traffic. * Google * Yahoo! * Baidu * Live • Secondary Web Search Engines • These are either smaller or not the primary search engine for access to databases from the Providers of Search listed below. * Exalead * Gigablast * WiseNut • Dead Search Engines • These search engines used to offer their own database or unique search features. They have all abandoned their position in search, although they still may have some kind of search functionality. * AlltheWeb * AltaVista *Excite * Infoseek * Inktomi
GSE: Databases • Web: • Indexed Web pages (also includes URLs that it has not fully indexed) • and additional file types in the Web database include PDF, .ps, .doc, .xls, .txt, .ppt, .rtf, .asp and more. • Ads: Paid advertisements usually shown on the right side (or top) under a "Sponsored Links" heading
GSE: Features • A large, unique search engine database • Includes cached copies of pages • utilize not only PageRank but more than 150 criteria to determine relevancy • Default Operation: Multiple search terms are processed as an AND operation by default. Phrase matches are ranked higher(Proximity Searching). • No truncation is available. • Case Sensitivity: using either lower or upper case results in the same hits.
GSE: Features contd. • Field searching • Language Limits: Default is all languages. 30+ language limits are available. • Stop Words: searches almost all words except for operators like AND. • Display: • The display includes the title, • URL, • a brief extract showing text near the search terms, • the file size, • and for many hits, a link to a cached copy of the page.
Review of Google • In Feb. 1999 Google moved from Alpha test version to Beta and officially launched Sept. 21, 1999. • Since that time it has made its mark with its relevance ranking based on link analysis, cached pages, and aggressive growth. • Since its beta release, it has had phrase searching and the - for NOT, but it did not add an OR operation until Oct. 2000. • In Dec. 2000, it added title searching. • In June 2000 it announced a database of over 560 million pages, which grew to over 600 million by the end of 2000 and then 1.5 billion in Dec. 2001. • The 2+ billion reported on their home page as of April 2002 includes indexed pages, unindexed URLs, and other file formats. By Nov. 2002, they moved their claim up to 3 billion, and in Feb. 2004 it went to 4 billion. • While no official claim is given, 20+ billion is once current estimate.
Review of Yahoo! • The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favourite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories ... and the core concept behind Yahoo! was born. • In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned AlltheWeb and AltaVista. • in 2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory..
Review of Live • Live Search is the successor to MSN Search. This is the Microsoft Web search engine. Launched in September 2006, it uses its own, unique database. • In 2004 it debuted a beta version of its own results, powered by its own web crawler (called msnbot). • In early 2005 it started showing its own results live. At the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. • In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process.
Review of Badu • Baidu (Chinese: 百度; pinyin: bǎi dù) is a popular Chinese search engine which launched in 2000 and can search text and images. As of January 2007, since at least as early as May 2006, it is fourth in Alexa's internet rankings with a market share of 52 percent. • Baidu provides an index of over 1 billion web pages.
Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines
Lucene, lucene.apache.org • Lucene is a free and open source information retrieval API, originally implemented in Java by Doug Cutting. Lucene has been ported to programming languages including Perl, C#, C++, Python, Ruby and PHP. • While suitable for any application which requires full text indexing and searching capability. • At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.
SWISH-E, swish-e.org • Swish-e stands for Simple Web Indexing System for Humans - Enhanced. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types. • Many sites use Swish-e
Outline • Background knowledge • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines
Visual Search Engine • A search returns both a list of search results and a tag cloud. The tag cloud contains the original search terms surrounded by related tags. The closer to the search terms, the larger the keyword suggestions (both in terms of font size and boldness), the more relevant they are deemed. Holding the mouse over a term will display a new set of results in the bottom window and will also show another keyword cloud overlaying the original.
Metasearch Engines • Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page.
Answer-based search engines • Answers.com:presents reference content in over four million entries, collected from multiple sources.
Reference • http://en.wikipedia.org/wiki/Search_engine • http://www.searchengineshowdown.com/ • http://searchenginewatch.com/ • http://www.searchtools.com/tools/tools.html • ……