270 likes | 370 Views
Search Engines: The players and the field. The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture of a search engine. The query engine. Mechanics of a typical search. Results & ads returned ranked. Category of first result.
E N D
Search Engines: The players and the field • The mechanics of a typical search. • The search engine wars. • Statistics from search engine logs. • The architecture of a search engine. • The query engine.
Tampere weather Mars surface images Nikon CoolPix Search on the Web • Corpus: The publicly accessible Web: static + dynamic • Goal: Retrieve high quality results relevant to the user’s need • (not docs!) • Need • Informational – want to learn about something • Navigational – want to go to that page • Transactional – want to do something (web-mediated) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Finland Abortion morality
Search Engines as Info Gatekeepers • Search engines are becoming the primary entry point for discovering web pages. • Ranking of web pages influences which pages users will view. • Exclusion of a site from search engines will cut off the site from its intended audience. • The privacy policy of a search engine is important. Introna & Nissenbaum: Defining the Web: The Politics of Search Engines Hindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web
Search Engine Wars • The battle for domination of the web search space is heating up! • The competition is good news for users! • Crucial: advertising is combined with search results! • What if one of the search engines will manage to dominate the space?
Yahoo! • Synonymous with the dot-com boom, probably the best known brand on the web. • Started off as a web directory service in 1994,acquired leading search engine technology in 2003. • Has very strong advertising and e-commerce partners
Lycos! • One of the pioneers of the field • Introduced innovations that inspired the creation of Google
Google • Verb “google” has become synonymous with searching for information on the web. • Has raised the bar on search quality • Has been the most popular search engine in the last few years. • Had a very successful IPO in August 2004. • Is innovative and dynamic. • Has restored glamour in CS lost in dot-com-bust
Live Search(was: MSN Search) • Synonymous with PC software. • Remember its victory in the browser wars with Netscape. • Developed its own search engine technology only recently, officially launched in Feb. 2005. • May link web search into its next version of Windows.
Ask Jeeves • Specialises in natural language question answering. • Search driven by Teoma.
Cuil • The latest kid on the block • Claims to have indexed 120B pages! • So far, it does not rank!
Experiment with query syntax • Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. • “+chess” in a query means the user insists that “chess” be present in all hits. • “computer OR chess” means either keywords must be present in all hits. • “”computer chess”” means that the phrase “computer chess” must be present in all hits.
Web search Users • Ill-defined queries • Short length • Imprecise terms • Sub-optimal syntax (80% queries without operator) • Low effort in defining queries • Wide variance in • Needs • Expectations • Knowledge • Bandwidth • Specific behavior • 85% look over one result screen only • mostly above the fold • 78% of queries are not modified • 1 query/session • Follow links – “the scent of information” ...
Query Distribution Power law: few popular broad queries, many rare specific queries
How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
User Web spider Search Indexer The Web Indexes Ad indexes Architecture of a Search Engine
Rate of web content change 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 [Cho00] Mathematically, what does this seem to be? What does this suggest for crawling policy?
Arts 14.6% Arts: Music 6.1% Computers 13.8% Regional: North America 5.3% Regional 10.3% Adult: Image Galleries 4.4% Society 8.7% Computers: Software 3.4% Adult 8% Computers: Internet 3.2% Recreation 7.3% Business: Industries 2.3% Business 7.2% Regional: Europe 1.8% … … … … Diversity • Languages/Encodings • Hundreds of languages, W3C encodings: 55 (Jul01) [W3C01] • Home pages (1997): English 82%, Next 15: 13% [Babe97] • Google (mid 2001): English: 53%, JGCFSKRIP: 30% • Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000)
Search Index - Inverted File Frequency • Also store position of word in web page (“offset”) and information on HTML structure.
The query engine • The interface between the search index, the user and the web. • Algorithmic details of commercial search engines are kept as trade secrets. • First step is retrieval of potential results from the index. • Second step is the ranking of the results based on their “relevance” to the query.
Crawling the Web Mode of crawl: BFS Frequency of crawl: important robots.txt gives explicit directions on what not to crawl Parallel machines crawl all the time