240 likes | 252 Views
This lecture discusses the challenges and strategies of web search engines, economic models, components of a web search service, and considerations for effective information retrieval. It also includes a case study on Google's architecture and performance.
E N D
CS 430: Information Discovery Lecture 18 Web Search Engines: Google
Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges • Volume of material -- several billion items, growing steadily • Items created dynamically or in databases • Great variety -- length, formats, quality control, purpose, etc. • Inexperience of users -- range of needs • Economic models to pay for the service
Strategies Subject hierarchies • Yahoo! -- use of human indexing Web crawling + automatic indexing • General -- Google, AltaVista, Ask Jeeves, NorthernLight, ... • Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com, ... Mixed models • Human directed web crawling and automatic indexing -- BBC News
Components of Web Search Service Components • Web crawler • Indexing system • Search system Considerations • Economics • Scalability
Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services
Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people.
Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: • Ranking hits in order that fits user's requirements • Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is not the most important factor.
Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text.
Effective Information Retrieval (cont) Full text indexingwith contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materialsand ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.
Google: Ranking 1. Paid advertisers 2. Manually created classification 3. Vector space ranking with corrections for document length 4. Extra weighting for specific fields, e.g., title, anchors, etc. 5. PageRank The balance between 3, 4, and 5 is not made public.
Usability: Dynamic Abstracts Query:Cornell sports LII: Law about...Sports...sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ...www.law.cornell.edu/topics/sports.html Query: NCAATarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ...www.law.cornell.edu/topics/sports.html
Limitations of Web Crawling • Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. • Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. • Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but ... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)
Scalability 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 1994 1997 2000 The growth of the web
Scalability Web search services are centralized systems • Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. • Will this continue? • Possible areas for concern are telecommunications costs, disk access rates.
Case Study: Google • Python with C/C++ • Linux • Module-based architecture • Multi-machine • Multi-thread
Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by disk I/O
Image Search: indexing by contextual information only
Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • High quality search is the dominant factor • precision • presentation of results