210 likes | 344 Views
WebGather Design and Implementation. Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf@net.cs.pku.edu.cn http://net.cs.pku.edu.cn/~yhf. Outline. Introduction of searchengine WebGather Conclusion. Introduction : http://www.yahoo.com/. Introduction : http://sohu.com/.
E N D
WebGather Design and Implementation Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf@net.cs.pku.edu.cn http://net.cs.pku.edu.cn/~yhf
Outline • Introduction of searchengine • WebGather • Conclusion
Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000 • GG=Google • WT=WebTop.com • AV=AltaVista, • FAST=FAST • NL=Northern Light • EX=Excite • INK=Inktomi, • Go=Go (Infoseek)
Introduction: a new study-- Inktomi and the NEC Research Institute, Inc. In Feb. 2000 • Number of indexable pages on the web : over 1 billion • Number of servers discovered: 6,409,521 • Number of mirrors in servers discovered: 1,457,946 • Number of sites (total servers minus mirrors): 4,951,247 • Number of good sites (reachable over 10 day period): 4,217,324 • Number of bad sites (unreachable): 733,923 Web pages on a site: 1000,000,000/4,217,324 = 237.1
Introduction: Inktomi Search Engine cluster In the picture 9*8*2=144
WebGather: Introduction • 由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于1997年10月29日正式在CERNET上向广大Internet用户提供web信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破800万人次。2000年初新成立的“天网”搜索引擎课题组在国家973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。 • http://e.pku.edu.cn/身无彩凤双飞翼,心有灵犀一点通
WebGather:in Dec. 1, 2000 • 2.5 million scale • Index 2.5 million web pages • More than 200,000 web pages everyday • Ten day to update all data • three PCs
238 X 40,000 = 9,520,000 WebGather: Design goals for a distributed web-crawling system for WebGather • collect all the web pages in China • keep pace with the rapid growth of Chinese web information
WWW User behavior Indexer Gatherer Retriever Client Client logdatabase Gather Database Retrieve Database WebGather 2.0: architecture
WebGather 1.2:architecture of gather subsystem 1/4 Gather2 … GatherN Gather1 Main Control
WebGather :technologiesin gather subsystem 1/4 • Distributed system architecture • High availability • …… • Load balance • Low bandwidth • Scalability • Re-configurability • …… • Cut words • Position relativity • Anchor text, Link popularity
WebGather :architecture of indexer subsystem 2/4 feature1 feature1 webpage1 webpage1 webpage2 webpage2 feature2 feature2 … … webpageK featureK feature1 webpage1 … webpage2 … feature2 webpageN featureN feature3 webpage3 A B
WebGather :technologiesin retriever subsystem 3/4 • Traditional IR (VSM ) • Query cache, hot click • Cut words • Anchor text, Link popularity
WebGather :technologiesin user behavior subsystem 4/4 • Link popularity • Replica popularity • User popularity
Conclusion : • Searchengine is More and more important. • Web is a good experimental object, we can do a lot R&D on it.