1 / 30

Alternatives to Federated Search

Alternatives to Federated Search. -. Presented by: Marc Krellenstein Date: July 29, 2005. Why did we ever build federated search? . No one search service or database had all relevant info or ever could have It was too hard to know what databases to search

Download Presentation

Alternatives to Federated Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005

  2. Why did we ever build federated search? • No one search service or database had all relevant info or ever could have • It was too hard to know what databases to search • Even if you knew which db’s to search, it was too inconvenient to search them all • Learning one simple interface was easier than learning many complex ones

  3. Do we still need federated search? • No

  4. No one service or db has all relevant info? • Databases have grown bigger than ever imagined • Google: 8B documents, Google scholar: 400M+ ? • Scirus: 200M • Web of Knowledge (Humanities, Social Sci, Science): 28M • Scopus: 27M • Pubmed: 14M • Why? • Cheaper and larger hard disks • Faster hardware, better software • World-wide network availability…no need to duplicate

  5. No one service or db has all relevant info? • No maximum size in sight • A good thing, because content continues to grow • The simplest technical model for search • Databases are logically single and central • …but physically multiple and internally distributed • Google has ~160,000 servers • The simplest user model for search • The catch (but even worse for federated search): • Get the data • Keep search quality high

  6. It’s hard to know what services to search? • Google/Google Scholar plus 1-2 vertical search tools • Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. • For casual searches: Google alone is usually enough • Specialized smaller db’s where needed • Known to researcher or librarian, or available from list • Ask a life science researcher what they use -- • “All I need is Google and Pubmed”

  7. It’s hard to know what services to search? • Alerts, RSS, etc. eliminate some searches altogether • Still…more than one search/source…but must balance inconvenience against costs of federated search: • Will still need to do multiple searches…federated not enough • Least common denominator search – few advanced features • Users are increasingly sophisticated • Duplicates • Slower response time • Broken connectors • The feeling that you’re missing stuff…

  8. One interface is easier to learn than many? • Yes…studies suggest users like a common interface (if not a common search service) • BUT Google has demonstrated the benefits of simplicity • More products are adopting simple, similar interfaces • There is still too much proprietary syntax – though advanced features and innovation justify some of it

  9. So what are today’s search challenges? • Getting the data for centralized and large vertical search services • Keeping search quality high for these large databases • Answering hard search questions

  10. Getting the data for centralized services • Crawl it if it’s free • …or make or buy it • Expensive, but usually worth the cost • Should still be cheaper for customers than many services • …or index multiple, maybe geographically separate databases with a single search engine that supports distributed search

  11. Distributed (local/remote) search • Use common metadata scheme (e.g., Dublin Core) • Search engine provides parallel search, integrated ranking/results • Google, Fast and Lucene already work this way even for ‘single’ database • The separate databases can be maintained/updated separately • Results are truly integrated…as if it’s one search engine • One query syntax, advanced capabilities, no duplicates, fast • Still requires common technology platform • Federated search standards may someday approximate this • Standard syntax, results metadata…ranking? Amazon’s A9?

  12. Keeping search quality high in big db’s • Can interpret keyword, Boolean and pseudo-natural language queries • Spell checking, thesauri and stemming to improve recall (and sometimes precision) • Get lots of hits in a big db, but that’s usually OK if there are good ones on top

  13. Keeping search quality high in big db’s • Current best practice relevancy ranking is pretty good: • Term frequency (TF): more hits count more • Inverse document frequency (IDF): hits of rarer search terms count more • Hits of search terms near each other count more • Hits on metadata count more • Use anchor text – referring text – as metadata • Items with more links/references to them count more • Authoritative links/referrers count yet more • Many other factors: length, date, etc. • Sophisticated ranking is a weak point for federated search • Google’s genius: emphasize popularity to eliminate junk from the first pages (even if you don’t always serve the best)

  14. But search challenges remain • Finding the best (not just good) documents • Popularity may not turn up the best, most recent, etc. • Answering hard questions • Hard to match multiple criteria • find an experimental method like this one • Hard to get answers to complex questions, • What precursors were common to World War I and World War II? • Summarize, uncover relationships, analyze • Long-term: understand any question… • None of the above helped by least common denominator federated search

  15. Finding the best • Don’t rely too much on popularity • Even then, relevancy ranking has its limits • “I need information on depression” • “Ok…here are 2,352 articles and 87 books” • Need a dialog…”what kind of depression” …”psychological”…”what about it?” • Underlying problem: most searches are under-specified

  16. One solution: clustering documents • Group results around common themes: same author, web site, journal, subject… • Blurt out largest/most interesting categories: the inarticulate librarian model • Depression  psychology, economics, meteorology, antiques… • Psychology  treatment of depression, depression symptoms, seasonal affective… • Psychology  Kocsis, J. (10), Berg, R. (8), … • Themes could come from static metadata or dynamically by analysis of results text • Static: fixed, clear categories and assignments • Dynamic: doesn’t require metadata/taxonomy

  17. Clustering benefits • Disambiguates and refines search results to get to documents of interest quickly • Can navigate long result lists hierarchically • Would never offer thousands of choices to choose from as input… • Access to bottom of list…maybe just less common • Won’t work with federated search that retrieves limited results from each • Discovery – new aspects or sources • Can narrow results *after* search • Start with the broadest area search – don’t narrow by subject or other categories first • Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven • Knee surgery  cartilage replacement, plastics, …

  18. Answering hard questions • Main problem is still short searches/under-specification • One solution: Relevance feedback – marking good and bad results • A long-standing and proven search refinement technique • More information is better than less • Pseudo-relevancy feedback is a research standard • Most commercial forms not widely used… • …but Pubmed is an exception • A catch: Must first find a good document to be similar to….may be hard or impossible

  19. One solution: descriptive search • Let the user or situation provide the ideal “document” – a full problem description – as input in the first place • Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description • Might draw on user or query context • Use thesauri, domain knowledge and limited natural language processing to identify must-have’s • Uses lots of data and statistics to find best matches • Again, a problem for federated search with limited data access • Should provide the best possible search short of real language understanding

  20. Summarize, discover & analyze • How do you summarize a corpus? • May want to report on what’s present, numbers of occurrences, trends • Ex: What diseases are studied the most? • Must know all diseases and look one by one • How to you find a relationship if you don’t know what relationships exist? • Ex:does gene p53 relate to any disease? • Must check for each possible relationship • Ad hoc analysis • How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence

  21. One solution: text mining • Identify entities (things) in a text corpus • Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… • Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) • Identify relationships: • Through co-occurrence • Relationship presumed from proximity • Example: author-university affiliation • Through limited natural language processing • Semantic relations – causes, is-part-of, etc. • Examples: drug-causes-disease, drug-treats-disease • Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)

  22. Gene-disease relationships?

  23. Relationships to p53

  24. Author teams In HIV research?

  25. Indirect links from leukemia to Alzheimer’s via enzymes

  26. Long-term: answer any question • Must recognize multiple (any) entities and relationships • Must recognize all forms of linguistic relationship • Must have background of common sense information (or enough entities/relations?) • Information on donors (to political parties) • For now, building text miners, domain by domain, is perhaps the best we can do • Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’

  27. Summary • Federated search addressed problems of a different time • Had a highly fragmented search space, limitations of individual db’s, technical and interface problems and need to just get basic answers • Today’s search environment is increasingly centralized and robust • Range of content and demands of users continue to increase • Adequate search is a given…really good search is a challenge best served by new technologies that don’t fit into a least-common-denominator framework • Need to locate best documents (sophisticated ranking, clustering) • Need to answer complex questions • Need to go beyond search for overviews, relationship discovery

More Related