1 / 17

Inside Internet Search Engines: Fundamentals

Inside Internet Search Engines: Fundamentals. Jan Pedersen and William Chang. Outline. Basic Architectures Search Directory Term definitions: Spidering, indexing etc. Business model. Basic Architectures: Search. Log. 20M queries/day. Spider. Web. SE. Spam. Index. SE. Browser.

Download Presentation

Inside Internet Search Engines: Fundamentals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inside Internet Search Engines:Fundamentals Jan Pedersen and William Chang Sigir’99

  2. Outline • Basic Architectures • Search • Directory • Term definitions: • Spidering, indexing etc. • Business model Sigir’99

  3. Basic Architectures: Search Log 20M queries/day Spider Web SE Spam Index SE Browser SE Freshness 24x7 Quality results 800M pages? Sigir’99

  4. Basic Architectures: Directory Url submission Surfing Ontology Web SE Browser SE SE Reviewed Urls Sigir’99

  5. Spidering • Web HTML data • Hyperlinked • Directed, disconnected graph • Dynamic and static data • Estimated 800M indexible pages • Freshness • How often are pages revisited? Sigir’99

  6. Indexing • Size • from 50 to 150M urls • 50 to 100% indexing overhead • 200 to 400GB indices • Representation • Fields, meta-tags and content • NLP: stemming? Sigir’99

  7. Search • Augmented Vector-space • Ranked results with Boolean filtering • Quality-based reranking • Based on hyperlink data • or user behavior • Spam • Manipulation of content to improve placement Sigir’99

  8. Sigir’99

  9. Queries • Short expressions of information need • 2.3 words on average • Relevance overload is a key issue • Users typically only view top results • Search is a high volume business • Yahoo! 50M queries/day • Excite 30M queries/day • Infoseek 15M queries/day Sigir’99

  10. Directory • Manual categorization and rating • Labor intensive • 20 to 50 editors • High quality, but low coverage • 200-500K urls • Browsable ontology • Open Directory is a distributed solution Sigir’99

  11. Sigir’99

  12. Hybrid Services • Query is used for navigation • Directory placement • Recommended • Point of integration • Multiple data sources • Web, News, Shopping, Community, etc. Sigir’99

  13. Sigir’99

  14. Business Model • Advertising • Highly targeted, based on query • Keyword selling; Between $3 to $25 CPM • Cost per query is critical • Between $.5 and $1.0 per thousand • Distribution • Many portals outsource search Sigir’99

  15. Basic Problem • Provide the highest quality search at the lowest possible cost • More traffic is better • More ad impressions • Targetable queries are better • Not all keywords are sold Sigir’99

  16. Web Resources • Search Engine Watch • www.searchenginewatch.com • “Analysis of a Very Large Alta Vista Query Log”; Silverstein et al. • SRC Tech note 1998-014 • www.research.digital.com/SRC Sigir’99

  17. Web Resources • “The Anatomy of a Large-Scale Hypertextual Web Search Engine”; Brin and Page • google.stanford.edu/long321.htm • WWW conferences • www8.org Sigir’99

More Related