300 likes | 392 Views
Alternatives to Federated Search. -. Presented by: Marc Krellenstein Date: July 29, 2005. Why did we ever build federated search? . No one search service or database had all relevant info or ever could have It was too hard to know what databases to search
E N D
Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005
Why did we ever build federated search? • No one search service or database had all relevant info or ever could have • It was too hard to know what databases to search • Even if you knew which db’s to search, it was too inconvenient to search them all • Learning one simple interface was easier than learning many complex ones
No one service or db has all relevant info? • Databases have grown bigger than ever imagined • Google: 8B documents, Google scholar: 400M+ ? • Scirus: 200M • Web of Knowledge (Humanities, Social Sci, Science): 28M • Scopus: 27M • Pubmed: 14M • Why? • Cheaper and larger hard disks • Faster hardware, better software • World-wide network availability…no need to duplicate
No one service or db has all relevant info? • No maximum size in sight • A good thing, because content continues to grow • The simplest technical model for search • Databases are logically single and central • …but physically multiple and internally distributed • Google has ~160,000 servers • The simplest user model for search • The catch (but even worse for federated search): • Get the data • Keep search quality high
It’s hard to know what services to search? • Google/Google Scholar plus 1-2 vertical search tools • Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. • For casual searches: Google alone is usually enough • Specialized smaller db’s where needed • Known to researcher or librarian, or available from list • Ask a life science researcher what they use -- • “All I need is Google and Pubmed”
It’s hard to know what services to search? • Alerts, RSS, etc. eliminate some searches altogether • Still…more than one search/source…but must balance inconvenience against costs of federated search: • Will still need to do multiple searches…federated not enough • Least common denominator search – few advanced features • Users are increasingly sophisticated • Duplicates • Slower response time • Broken connectors • The feeling that you’re missing stuff…
One interface is easier to learn than many? • Yes…studies suggest users like a common interface (if not a common search service) • BUT Google has demonstrated the benefits of simplicity • More products are adopting simple, similar interfaces • There is still too much proprietary syntax – though advanced features and innovation justify some of it
So what are today’s search challenges? • Getting the data for centralized and large vertical search services • Keeping search quality high for these large databases • Answering hard search questions
Getting the data for centralized services • Crawl it if it’s free • …or make or buy it • Expensive, but usually worth the cost • Should still be cheaper for customers than many services • …or index multiple, maybe geographically separate databases with a single search engine that supports distributed search
Distributed (local/remote) search • Use common metadata scheme (e.g., Dublin Core) • Search engine provides parallel search, integrated ranking/results • Google, Fast and Lucene already work this way even for ‘single’ database • The separate databases can be maintained/updated separately • Results are truly integrated…as if it’s one search engine • One query syntax, advanced capabilities, no duplicates, fast • Still requires common technology platform • Federated search standards may someday approximate this • Standard syntax, results metadata…ranking? Amazon’s A9?
Keeping search quality high in big db’s • Can interpret keyword, Boolean and pseudo-natural language queries • Spell checking, thesauri and stemming to improve recall (and sometimes precision) • Get lots of hits in a big db, but that’s usually OK if there are good ones on top
Keeping search quality high in big db’s • Current best practice relevancy ranking is pretty good: • Term frequency (TF): more hits count more • Inverse document frequency (IDF): hits of rarer search terms count more • Hits of search terms near each other count more • Hits on metadata count more • Use anchor text – referring text – as metadata • Items with more links/references to them count more • Authoritative links/referrers count yet more • Many other factors: length, date, etc. • Sophisticated ranking is a weak point for federated search • Google’s genius: emphasize popularity to eliminate junk from the first pages (even if you don’t always serve the best)
But search challenges remain • Finding the best (not just good) documents • Popularity may not turn up the best, most recent, etc. • Answering hard questions • Hard to match multiple criteria • find an experimental method like this one • Hard to get answers to complex questions, • What precursors were common to World War I and World War II? • Summarize, uncover relationships, analyze • Long-term: understand any question… • None of the above helped by least common denominator federated search
Finding the best • Don’t rely too much on popularity • Even then, relevancy ranking has its limits • “I need information on depression” • “Ok…here are 2,352 articles and 87 books” • Need a dialog…”what kind of depression” …”psychological”…”what about it?” • Underlying problem: most searches are under-specified
One solution: clustering documents • Group results around common themes: same author, web site, journal, subject… • Blurt out largest/most interesting categories: the inarticulate librarian model • Depression psychology, economics, meteorology, antiques… • Psychology treatment of depression, depression symptoms, seasonal affective… • Psychology Kocsis, J. (10), Berg, R. (8), … • Themes could come from static metadata or dynamically by analysis of results text • Static: fixed, clear categories and assignments • Dynamic: doesn’t require metadata/taxonomy
Clustering benefits • Disambiguates and refines search results to get to documents of interest quickly • Can navigate long result lists hierarchically • Would never offer thousands of choices to choose from as input… • Access to bottom of list…maybe just less common • Won’t work with federated search that retrieves limited results from each • Discovery – new aspects or sources • Can narrow results *after* search • Start with the broadest area search – don’t narrow by subject or other categories first • Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven • Knee surgery cartilage replacement, plastics, …
Answering hard questions • Main problem is still short searches/under-specification • One solution: Relevance feedback – marking good and bad results • A long-standing and proven search refinement technique • More information is better than less • Pseudo-relevancy feedback is a research standard • Most commercial forms not widely used… • …but Pubmed is an exception • A catch: Must first find a good document to be similar to….may be hard or impossible
One solution: descriptive search • Let the user or situation provide the ideal “document” – a full problem description – as input in the first place • Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description • Might draw on user or query context • Use thesauri, domain knowledge and limited natural language processing to identify must-have’s • Uses lots of data and statistics to find best matches • Again, a problem for federated search with limited data access • Should provide the best possible search short of real language understanding
Summarize, discover & analyze • How do you summarize a corpus? • May want to report on what’s present, numbers of occurrences, trends • Ex: What diseases are studied the most? • Must know all diseases and look one by one • How to you find a relationship if you don’t know what relationships exist? • Ex:does gene p53 relate to any disease? • Must check for each possible relationship • Ad hoc analysis • How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence
One solution: text mining • Identify entities (things) in a text corpus • Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… • Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) • Identify relationships: • Through co-occurrence • Relationship presumed from proximity • Example: author-university affiliation • Through limited natural language processing • Semantic relations – causes, is-part-of, etc. • Examples: drug-causes-disease, drug-treats-disease • Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)
Author teams In HIV research?
Indirect links from leukemia to Alzheimer’s via enzymes
Long-term: answer any question • Must recognize multiple (any) entities and relationships • Must recognize all forms of linguistic relationship • Must have background of common sense information (or enough entities/relations?) • Information on donors (to political parties) • For now, building text miners, domain by domain, is perhaps the best we can do • Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’
Summary • Federated search addressed problems of a different time • Had a highly fragmented search space, limitations of individual db’s, technical and interface problems and need to just get basic answers • Today’s search environment is increasingly centralized and robust • Range of content and demands of users continue to increase • Adequate search is a given…really good search is a challenge best served by new technologies that don’t fit into a least-common-denominator framework • Need to locate best documents (sophisticated ranking, clustering) • Need to answer complex questions • Need to go beyond search for overviews, relationship discovery