200 likes | 308 Views
Search. Stephen Robertson Microsoft Research Cambridge. MSR Cambridge. Andrew Herbert, Director Cambridge Laboratory … External Research Office Stephen Emmott. MSR Cambridge. Systems & Networking – Peter Key Operating Systems Networking Distributed Computing
E N D
Search Stephen Robertson Microsoft Research Cambridge Moscow
MSR Cambridge • Andrew Herbert, Director • Cambridge Laboratory … • External Research Office • Stephen Emmott Moscow
MSR Cambridge • Systems & Networking – Peter Key • Operating Systems • Networking • Distributed Computing • Machine Learning & Perception – Christopher Bishop • Machine Learning • Computer Vision • Information Retrieval Moscow
MSR Cambridge • Programming Principles & Tools – Luca Cardelli • Programming Principles & Tools • Security • Computer-Mediated Living – Ken Wood • Human Computer Interaction • Ubiquitous Computing • Sensors and Devices • Integrated Systems Moscow
Search: a bit of history People sometimes assume that G**gle invented search … but of course this is false • Library catalogues • Scientific abstracts • Printed indexes • The 1960s to 80s: Boolean search • Free text queries and ranking – a long gestation • The web Moscow
Web search • The technology • Crawling • Indexing • Ranking • Efficiency and effectiveness • The business • Integrity of search • UI, speed • Ads • Ad ranking • Payment for clickthrough Moscow
Other search environments • Within-site • Specialist databases • Enterprise/intranet • Desktop Moscow
How search engines work • Crawl a lot of documents • Create a vast index • Every word in every document • Point to where it occurred • Allow documents to inherit additional text • From the url • From anchors in other documents… • Index this as well • Also gather static information Moscow
How search engines work Given a query: • Look up each query word in the index • Throw all this information at the ranker Ranker: A computing engine which calculates a score for each document, and identifies the top n scoring documents Score depends on a whole variety of features, and may include static information Moscow
A core challenge: ranking • What features might be useful? • Features of the query-document pair • Features of the document • Maybe features of the query • Simple / transformed / compound • Combining features • Formulae • Weights and other free parameters • Tuning / training / learning Moscow
Ranking algorithms • Based on probabilistic models • we are trying to predict relevance • … plus a little linguistic analysis • but this is secondary to the statistics • … plus a great deal of know-how, experience, experiment • Need: • Evidence from all possible sources • … combined appropriately Moscow
Evaluation • User queries • Relevance judgements • by humans • yes-no or multilevel • Evaluation measures • How to evaluate a ranking? • Only the top end matters • Various different measures in use • Public bake-offs • TREC etc. Moscow
Using evaluation data for training • Task: to optimise a set of parameters • E.g. weights of features • Optimisation is potentially very powerful • Can make a huge difference to effectiveness • But there are challenges… Moscow
Challenge 1: Optimisation methods • Training is something of a black art • Not easy to write recipes for • Much work currently on optimisation methods • Some of it coming from the machine learning community Moscow
Challenge 2: a tradeoff • Many features require many parameters • From a machine learning point of view, the more the better • Many parameters means much training • Human relevance judgements are expensive Moscow
Challenge 3: How specific? • How much does the environment matter? • Different features • E.g. characteristics of documents, file types, linkage, statistical properties… • Different kinds of queries • Or different mixes of the same kinds • Different factors affecting relevance • Access constraints • … Moscow
Challenge 3: How specific? • And if it does matter… How to train for the specific environment? • Web search: huge training effort • Enterprise: some might be feasible • Desktop: unlikely • Within-site / specialist databases: some might be feasible Moscow
Looking for alternatives If training is difficult… Some other possibilities: • Robustness – parameters with stable optima (probably means fewer features) • Training tool-kits (but remember the black art) • Auto-training – a system that trains itself on the basis of clickthrough (a long-term prospect) Moscow
A little about Microsoft • Web search: MSN takes on Google and Yahoo • New search engine is closing the gap • Some MSRC input • Enterprise search: MS Search and SharePoint • New version is on its way • Much MSRC input • Desktop: also MS Search Moscow
Final thoughts • Search has come a long way since the library card catalogue • … but it is by no means a done deal • This is a very active field • both academically and commercially I confidently expect that it will change as much in the next 16 years as it has since 1990 Moscow