80 likes | 221 Views
Challenges in Web Search. Amit Singhal. Web Search. Crawl, Index, Search Crawl and Index freshness coverage (page selection, deep web) Search adversarial IR, trust evaluation partitioning the query space. Crawl and Index. Freshness pages are deleted, created, changed
E N D
Challenges in Web Search Amit Singhal
Web Search • Crawl, Index, Search • Crawl and Index • freshness • coverage (page selection, deep web) • Search • adversarial IR, trust • evaluation • partitioning the query space
Crawl and Index • Freshness • pages are deleted, created, changed • How to keep the index fresh? • Coverage • which 2.5B pages to index? • lot of useful information in databases • How to index “hidden” content?
Search • Adversarial IR • all useful signals are spammed
Search • Trust • how much can we trust a site • an article hosted at BBC is much more trustworthy than the same article hosted at yet-another-news-company.com • How trustworthy is a site, and how to use this information in ranking?
Search • Evaluation • the collection changes continuously • rel. pages become non-rel., and vice-versa • can’t easily freeze a copy • relevance is a function of rendering • need all images, all redirects, CSS, … • linkage characteristics change over time • query space is huge (over 150M/day) • most popular query: 0.037%, 10th most popular: 0.011% • need a very large query set, expensive • How to evaluate given changing collection and a very big query space?
Search • Ranking in a huge query space • specific methods work well for specific query types • e.g strong proximity helps for people names • identify query type and use type-specific ranking algorithms • How to partition the query space into meaningful and useful partitions?
Web Search • How to keep the index fresh? • How to index “hidden” content? • How trustworthy is a site, and how to use this information in ranking? • How to evaluate given changing collection and a very big query space? • How to partition the query space into meaningful and useful partitions? • It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. Sir Arthur Conan Doyle(1859 - 1930)