90 likes | 196 Views
MetaSearch. R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst. Introduction. MetaSearch / Distributed Retrieval Well defined problem Language Models are a good way to solve these problems. Grand Challenge
E N D
MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Introduction • MetaSearch / Distributed Retrieval • Well defined problem • Language Models are a good way to solve these problems. • Grand Challenge • Massively Distributed Multi-lingual Retrieval
MetaSearch • Combine results from different search engines. • Single Database – Or Highly Overlapped Databases. • Example, Web. • Multiple Databases or Multi-lingual databases. • Challenges • Incompatible scores even if the same search engine is used for different databases. • Collection Differences, and engine differences. • Document Scores depend on query. Combination on a per query basis makes training difficult. • Current Solutions involve learning how to map scores between different systems. • Alternative approach involves aggregating ranks.
Current Solutions for MetaSearch – Single Database Case • Solutions • Reasonable solutions involving mapping scores either by simple normalization, equalizing score distributions, training • Rank Based methods – eg Borda counts, Markov Chains.. • Mapped scores are usually combined using linear weighting. • Performance improvement about 5 to 10%. • Search engines need to be similar in performance • May explain why simple normalization schemes work. • Other Approaches • A Markov Chain approach has been tried. However, results on standard datasets are not available for comparison. • Shouldn’t be difficult to try more standard LM approaches.
Challenges – MetaSearch for Single Databases • Can one combine search engines which differ a lot in performance effectively? • Improve performance even using poorly performing engines? How? • Or use resource selection like approach case to eliminate poorly performing engines on a per query basis. • Techniques from other fields. • Techniques in economics and social sciences for voter aggregation may be useful (Borda count, Condorcet ..) • LM approaches • Will possibly improve performance by characterizing the scores at a finer granularity than say score distributions.
Multiple Databases • Two main factors determine variation in document scores • Search engine scoring functions. • Collection variations which essentially change the IDF. • Effective score normalization requires • Disregarding databases which are unlikely to have the answer • Resource Selection. • Normalizing out collection variations on a per query basis. • Mostly ad hoc normalizing functions. • Language Models. • Resource Descriptions already provide language models for collections. • Could use these to factor out collection variations. • Tricky to do this for different search engines.
Multi-lingual Databases • Normalizing scores across multiple databases. • Difficult Problem • Possibility: • Create language models for each database. • Use simple translation models to map across databases. • Use this to normalize scores. • Difficult.
Distributed Web Search • Distribute web search over multiple sites/servers. • Localized/ Regional. • Domain dependent. • Possibly no central coordination. • Server Selection/ Database Selection with/without explicit queries. • Research Issues • Partial representations of the world. • Trust, Reliability. • Peer to peer.
Challenges • Formal Methods for Resource Descriptions, Ranking, Combination • Example. Language Modeling • Beyond collections as big documents • Multi-lingual retrieval • Combining the outputs of systems searching databases in many languages. • Peer to Peer Systems • Beyond broadcasting simple keyword searches. • Non-centralized • Networking considerations e.g. availability, latency, transfer time. • Distributed Web Search • Data, Web Data.