260 likes | 276 Views
Metasearch. Thanks to Eric Glover. Outline. The general search tool Three C’s of Metasearch and other important issues Metasearch engine architecture Current metasearch projects Advanced metasearch. A generic search tool. Interface. Database. Query entered into an interface
E N D
Metasearch Thanks to Eric Glover
Outline • The general search tool • Three C’s of Metasearch and other important issues • Metasearch engine architecture • Current metasearch projects • Advanced metasearch
A generic search tool Interface Database • Query entered into an interface • Applied to a database of content • Results are scored/ordered • Displayed through the interface Ordering Policy
Interface 1 Database1 Ordering Policy1 Interface 2 Database2 Interface 3 Database3 Ordering Policy3 Why do metasearch? 2 3 1 Ordering Policy2
Definition The word meta comes from the Greek meaning, “situated beyond, transcending.” A metasearch tool is a tool which “transcends” the offerings of a single service by allowing you to access many different Web search tools from one site. More specifically, a metasearch tool permits you to launch queries to multiple search services via a common interface. From Carleton College Library Note: Metasearch is not necessary limited to the Web
The three C’s of metasearch • Coverage - How much of the total is accessible from a single interface • Consistency - One interface and resistant to single search service failures • Convenience - One stop shop Service1 Service2 Interface Service3
Coverage • Coverage refers to the total accessible content, specifically what percentage of the total content • According to the Journal Nature in July 1999 (Lawrence and Giles 99): • Web search engines collectively covered 42% of web (estimated at 800 Million total indexable pages). • Most for a single engine only 16% • Some search services have proprietary content, accessible only through their interface, I.e. an auction site • Search services have different rates of “refresh” • Special purpose search engines are more “up to date” on a special topic
Consistency and Convenience • Consistency • One interface • User can access multiple DIFFERENT search services, using the same interface • Will work even if one search service goes down, or is inconsistent • Convenience • One stop shop • User need not know about all possible sources • Metasearch system owners will automatically add new sources as needed • One interface improves convenience (as well as consistency) • User does not have to manually change their query for each source
Metasearch issues • What do you search • Source selection • How do you search • Query translation -- syntax/semantics • Query submission • Use of specialized knowledge or actions • How to process results • Fusion of results • Actions to improve quality • How to evaluate • Did the individual query succeed • Were the correct decisions made • Does this architecture work • Decisions after search • Search again, do something differently
Metasearch issues • Performance of underlying services • Time, reliability, consistency • Result quality - of a single search service • How “relevant” are results on average • Duplicate detection • Freshness -- how often are results updated • Update rate, dead links, changed content • Probability of finding relevant results • For the given query for each potential source • For the given need category • GLOSS, and similar models • How to evaluate • Is it better than some other service • Especially important with source selection • Feedback for improving fusion policy • Learning for personalization
Some metasearch engines • Ask Jeeves -- Natural language, and combines local and outside content with a very simple interface • Sherlock -- Integrates web and local searching • Metacrawler -- Early web metasearch engine, some formalizations • SavvySearch -- Research on various methods of source selection • ProFusion -- Attempted to predict subject of query and considered expected performance, both for relevance and time of search engines • Inquirus -- Content-Based Metasearch Engine • Inquirus2 -- Preference based metasearch engine
Architecture Service1 Query User Interface Service2 Dispatcher Results Service3 Fusion Policy Result Retriever
Architecture -- Dispatcher • Query translation • Each search service has a different interface (for queries) • Preserve semantics, while converting the syntax • Could result in loss of expressiveness • Query submission • sends query to the service • Source selection • Choose sources to be queried • Some systems use wrapper code, or agents as their dispatch mechanism
Result processor • Accept results from search service • Parse results, and relevant information -- i.e. title, URL, etc… • Can request more results (feedback to dispatcher) • Advanced processors could get more information about the document, I.e., use special purpose tools
Result Fusion • How to merge results in a meaningful manner? ? } { A = [a1, a2, a3] B = [b1, b2, b3] C = [c1, c2, c3]
Result Fusion • Not comparing apples to apples • Incomplete information, only have a summary • Each search service has their own ranking policy • Which is better AltaVista #2 or Google #3? • Summaries and titles are not consistent between search services • Don’t have complete information • Questions/issues • How much do you trust regular search engine ranks? • Could AltaVista #3 be better than AltaVista #2? • Is one search engine always better than another? • How do you integrate between search engines? • What about results returned by more than one search service?
Fusion policy - a typical approach • First determine the “score” on a fixed range for each result from each search engine • In the early days most search engines returned their scores • Score could be a function of the rank, or occasionally based on the keywords in the title/summary/URL • Second, assign a weight for each search engine • Could be based on predicted subject, stated preferences, special knowledge about a particular search engine • Example: • Service1: A1=1.000, A2=1.000, A3=.95,A4=0.5 • Service2: B1=.95, B2=.95, B3=.89, B4=.8 • W1 = 0.9, W2=1.0 -- final ordering would be: • [B1,B2,A1,A2,B3,A3,B4,A4]
Source Selection • Choosing which services to query • GLOSS -- Choose the databases most likely to contain “relevant” materials • SavvySearch (old) -- Choose sources most likely to have the most valuable results based on past responses • SavvySearch (new) -- Choose sources most appropriate for the user’s category • ProFusion -- User chooses: • 1: Fastest sources • 2: Most likely to have “relevant” based on predicted subject • 3: Explicit user selection • Watson -- Choose both general purpose sources and most “appropriate” special purpose sources
Metacrawler • MetaCrawler • Used user result clicks -- implicit vs. explicit • Not all pages clicked are relevant • Assumed pages not clicked were not relevant • Parameters examined • Comprehensiveness of individual search engines -- considered Unique document percentage UDP related to coverage • As expected, low overlap (assuming first ten documents only) • Relative contribution of each search engine, Viewed Document Share (VDS) • As expected, all services used contributed to the VDS -- the maximum of the eight search services was 23%, and the minimum 4%, with four of them contributing 15% or more
ProFusion • ProFusion • Focus was primarily on source selection • Profusion’s considered: • Performance (time) • Ability to locate relevant results (sources) by query subject prediction • Design • A set of 13 categories and a dictionary of 4000 terms used to predict subject • For each category each search engine (of the six) is “tested” and scored based on the number of relevant results retrieved • Individual search engine “scores” are used to produce a rank ordering by subject, and to fuse results • Parameters examined • Human judgements of some queries • Every search engine was different • ProFusion demonstrated improvements when the “auto-pick” was used
SavvySearch (early work) • Similar to ProFusion, choose sources based on query • Assign “scores” for each query term based on previous query results • Formed a txn matrix (terms by search engines) called a meta-index • Score is based on performance for each term: • Two “events” - No Results and Visits • Scores are adjusted for the number of query terms • Response time is also stored for each search engine • Search engines are chosen based on query terms, and past performance • System load determines the total number queried • Evaluated via a pilot study • Compared various variations of the sources chosen and their rank • As predicted using the meta-index method was better than random • Also examined improvements over time in the no result count • New version user chooses a “category” and appropriate services are used
Advanced metasearch • Improvements • Ordering policy verses a fusion policy -- Inquirus, and some personal search agents • Using complete information -- download document before scoring • All documents scored consistently regardless of source • Allows for improved relevance judgements • Can hurt performance • Query modification -- Inquirus 2, Watson, others? • User queries are modified to improve ability to find “category specific” results • Query modification allows general purpose search engines for special needs • Need based scoring -- Inquirus 2, Learning systems • Documents scores are based on more than query • Can use past user history, or other factors such as subject
Design/Research areas • Source selection • With and without complete information • Learning interfaces and predicting contents of sources • Intelligent Fusion • Without complete information • Consider user’s preferences • Resource efficiency • Knowing how to “plan” a search • User interfaces • How to minimize loss of expressiveness • How to preserve the capabilities of select services • To hide slow performance
Business issues for metasearch • How does one use others resources and not get blocked? • Skimming!