180 likes | 326 Views
Project Topic: Performance and Cost Tradeoffs in Web Search Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat. Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113. Outline. Motivation and Introduction by Johnny
E N D
Project Topic:Performance and Cost Tradeoffs in Web SearchNick Craswell, Francis Crimmins, David Hawking, Alistair Moffat Reviewed By: Johnny Sia, csia005 Allen Wang, awan015 Li Li, lli057 Hui Zhang, hzha113
Outline • Motivation and Introduction by Johnny • Background information by Allen • Case Study and cost model analysis (Research Finder) by Li • New hybrid approach and Conclusion by Henry
Motivation • Web search engines crawl the web to gather the data that they index • Slowly crawling the web to download pages from many websites result in a large amount of data being transferred across networks • These network costs must be paid for! • In the case of Google, ONE crawl of the 3 billion websites it indexes would have a network cost of over $1.5 million
Introduction • Two standard approaches of providing a search service • Periodic Crawling (eg. Google) • Metasearch (eg. MetaCrawler) • A new alternative • A crawl-metasearch hybrid model
Aim • Aim: To find the most cost-effective way to support web search services. • Where does cost come from? Nothing is FREE!! Data traffic cost a lot!! • Two common approaches: • Web-Crawling • MetaSearch
Web-Crawling • What is a crawler? A: a program that automatically collects Web pages to create a local index. • Pros. • Less query processing time required • Fast response to users • Fixed amount of cost • Cons. • Expensive!!! • Indexed data become stale
Metasearch How does it work? Users Metasearch QUERY Results Merging Query Wrapping Final Results QUERY QUERY QUERY QUERY University of Auckland AUT MIT MSDN (Microsoft) Results Results Results Results Local Search Engines
Metasearch • Pros. • Cheap to maintain (really?) • “Fresh” data • Cons. • Quality of the search depends on local servers. • Need “wrapper” to forward queries • Results from various servers need to be merged
Case Study • Panoptic (Research Finder) • Searchable full-text index-based retrieval system • Based on regular crawl • The newest version also introduced metasearch model • Operated by a range of Australia research institutions • Eight largest Australian Universities contribute more data to the Panoptic crawl
Case Study (cont’d) • Rate of change • Pages which have disappeared • Pages which changed so much => bad results • Pages which changed a little => good answers • Changes (c) could be ignored • Changes (a) and (b) are most important in a search system. Why? A crawl becomes stale, the users are more likely to see an embarrassing result.
Case Study (cont’d) • Over a eight days period: • Disappearance: 1.6% • Small changes: 8.2% • Large changes: 6.4% • No changes: 83.8% Normally, pages in .com domain change more frequently than those in the .edu domain (Result from 151 million pages)
Fq Unit: queries/month Value:10,000 Query arrival rate Sq Unit: GB/query Value:2x10^(-5) = 20kB Size of query resp pg Nc Value:175 Nbr of servers being federated Cost Model Fc Unit: crawls/month Value:1 Crawl frequency For crawling: Fc x Sd x So x Ct Sd Unit: GB Value:33.3 Combined data size For answering queries: Fq x Sq x (Nc + 1) x Ct So B_fetched/B_indexed Ct Value:1.7 Crawling overhead Unit:$/GB Trans cost 1NZD = 0.9368AUD 0.07NZD/MB =0.066AUD/MB Value:22.5 =0.0225$/MB
Cost Model (cont’d) Fc x Sd x So x Ct Fq x Sq x (Nc + 1) x Ct The number of query per month is low: Metasearch is cost effective The number of query per month is high: Crawl is cost effective
Performance and Cost Tradeoffs in Web Search • New hybrid approach • A full index is suitable for large query load, however, metasearch would be better if query arrival rate is lower • Metasearches the largest organizations and crawls the others • Can reduces the crawl cost by approximately half • e.g. proof of concept demonstration at: http://thylacine.panopticsearch.com/hybriddemo/index.cgi • Still face the disadvantages of metasearch • e.g. The need to write wrappers, response time issues, the rely on quality local search services
Performance and Cost Tradeoffs in Web Search • Conclusion • The group presented useful cost models and discussed several alternative approaches • Regrettably, many of the discussed options are not currently feasible • The most promising cost-reduction alternative in the current situation seems to be an incremental, variable frequency crawling • This model could be incorporated into a hybrid metasearch model with further savings, provided result merging can be performed sufficiently in the future • No single reasonable solution for all the operational search systems • The state-of-the-art in this research area remains a challenging and attractive subject