Inside Internet Search Engines: Spidering and Indexing

Inside Internet Search Engines:Spidering and Indexing Jan Pedersen and William Chang Sigir’99

Basic Architectures: Search Log 20M queries/day Spider Web SE Spam Index SE Browser SE Freshness 24x7 Quality results 800M pages? Sigir’99

Basic Algorithm • (1) Pick Url from pending queue and fetch • (2) Parse document and extract href’s • (3) Place unvisited Url’s on pending queue • (4) Index document • (5) Goto (1) Sigir’99

Issues • Queue maintenance determines behavior • Depth vs breadth • Spidering can be distributed • but queues must be shared • Urls must be revisited • Status tracked in a Database • Revisit rate determines freshness • SE’s typically revisit every url monthly Sigir’99

Deduping • Many urls point to the same pages • DNS aliasing • Many pages are identical • Site mirroring • How big is my index, really? Sigir’99

Smart Spidering • Revisit rate based on modification history • Rapidly changing documents visited more often • Revisit queues divided by priority • Acceptance criteria based on quality • Only index quality documents • Determined algorithmically Sigir’99

Spider Equilibrium • Urls queues do not increase in size • New documents are discovered and indexed • Spider keeps up with desired revisit rate • Index drifts upward in size • At equilibrium index is Everyday Fresh • As if every page were revisited every day • Requires 10% daily revisit rates, on average Sigir’99

Computational Constraints • Equilibrium requires increasing resources • Yet total disk space is a system constraint • Strategies for dealing with space constraints • Simple refresh: only revisit known urls • Prune urls via stricter acceptance criteria • Buy more disk Sigir’99

Special Collections • Newswire • Newsgroups • Specialized services (Deja) • Information extraction • Shopping catalog • Events; recipes, etc. Sigir’99

The Hidden Web • Non-indexible content • Behind passwords, firewalls • Dynamic content • Often searchable through local interface • Network of distributed search resources • How to access? • Ask Jeeves! Sigir’99

Spam • Manipulation of content to affect ranking • Bogus meta tags • Hidden text • Jump pages tuned for each search engine • Add Url is a spammer’s tool • 99% of submissions are spam • It’s an arms race Sigir’99

Representation • For precision, indices must support phrases • Phrases make best use of short queries • The web is precision biased • Document location also important • Title vs summary vs body • Meta tags offer a special challenge • To index or not? Sigir’99

Indexing Tricks • Inverted indices are non-incremental • Design for compactness and high-speed access • Updated through merge with new indices • Indices can be huge • Minimize copying • Use Raid for speed and reliability Sigir’99

Truncation • Search Engines do not store all postings • How could they? • Tuned to return 10 good hits quickly • Boolean queries evaluated conservatively • Negation is a particular problem • Some measurement methods depend on strong queries – how accurate can they be? Sigir’99

The Role of NLP • Many Search Engines do not stem • Precision bias suggests conservative term treatment • What about non-English documents • N-grams are popular for Chinese • Language ID anyone? Sigir’99

Inside Internet Search Engines: Spidering and Indexing