CS 430: Information Discovery

CS 430: Information Discovery Lecture 18 Web Search Engines: Google

Course Administration •

Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges • Volume of material -- several billion items, growing steadily • Items created dynamically or in databases • Great variety -- length, formats, quality control, purpose, etc. • Inexperience of users -- range of needs • Economic models to pay for the service

Strategies Subject hierarchies • Yahoo! -- use of human indexing Web crawling + automatic indexing • General -- Google, AltaVista, Ask Jeeves, NorthernLight, ... • Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com, ... Mixed models • Human directed web crawling and automatic indexing -- BBC News

Components of Web Search Service Components • Web crawler • Indexing system • Search system Considerations • Economics • Scalability

Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services

Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people.

Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: • Ranking hits in order that fits user's requirements • Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is not the most important factor.

Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text.

Effective Information Retrieval (cont) Full text indexingwith contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materialsand ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

Google: Ranking 1. Paid advertisers 2. Manually created classification 3. Vector space ranking with corrections for document length 4. Extra weighting for specific fields, e.g., title, anchors, etc. 5. PageRank The balance between 3, 4, and 5 is not made public.

Usability: Display of Results

Usability: Dynamic Abstracts Query:Cornell sports LII: Law about...Sports...sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ...www.law.cornell.edu/topics/sports.html Query: NCAATarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ...www.law.cornell.edu/topics/sports.html

Limitations of Web Crawling • Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. • Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. • Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but ... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)

Scalability 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 1994 1997 2000 The growth of the web

Scalability Web search services are centralized systems • Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. • Will this continue? • Possible areas for concern are telecommunications costs, disk access rates.

Case Study: Google • Python with C/C++ • Linux • Module-based architecture • Multi-machine • Multi-thread

Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by disk I/O

Image Search: indexing by contextual information only

Google API

Selective searching

Google News

Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • High quality search is the dominant factor • precision • presentation of results

CS 430: Information Discovery