450 likes | 464 Views
Learn about modern search engines, relevancy ranking, challenges in search integration, distributed search architectures, and finding the best documents efficiently. Understand clustering benefits for refining search results.
E N D
Next generation search Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004 m.krellenstein@elsevier.com
Basic search is pretty good • Modern search engines are fast and scalable • Having the data (usually lots) is still key • Can interpret keyword, Boolean and pseudo-natural language queries • Ex: “how to make an international call” • Spell checking, thesauri and stemming to improve recall (and sometimes precision) • Recall = % of relevant documents found • Precision = % of returned documents that are relevant • Get lots of hits, but that’s usually OK if there are good ones on top
Basic search is pretty good • Best practice relevancy ranking is good: • Term frequency (TF): more hits count more • Inverse document frequency (IDF): hits of rarer search terms count more • Ex: diabetes diagnosis and treatment • Hits of search terms near each other count more • Ex: penicillin allergy vs. “penicillin allergy” • Hits on metadata (title,subject, etc.) count more • Use anchor text – referring text – as metadata • Items with more links/references to them count more • Authoritative links/referrers count yet more • Many other factors: length, date, etc.
Basic search is pretty good • Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics • But challenges remain…
Current challenges • Integrated search: Content still in silos • Silos getting bigger but there are still dozens • Finding the best (not just good) documents • Answering hard questions • Hard to match multiple criteria • find an experimental method like this one • Hard to get answers to complex questions, • patient X with pre-existing conditions Y presents with Z…what information is relevant? • Summary, discovery and analysis • Summarize, uncover relationships, analyze • Long-term: understand any question…
The integration challenge • Two approaches: • Build bigger databases • Sometimes the easiest way… • …but can be difficult or impossible to secure appropriate rights and consolidate data • Distributed search: Search separately managed (or owned) large databases as if they are one • Technically more challenging, but a scalable and maintainable architecture
Distributed search • Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search • Use common metadata scheme (e.g., Dublin Core) and/or determine other common fields or field mappings for each database • Search engine provides parallel search, integrated ranking and integrated results • The separate databases can be maintained and updated separately • Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture • Such services can also be offered externally
Distributed search • Simplifies some business issues, but still requires common technology platform • Where common platform not possible, can use federated search (i.e., metasearch) • Translate queries • Access and perform parallel search of multiple search engines (vs. multiple databases) • Integrate results as best as possible • Use standards to approximate distributed research • Uniform access, one query language (Z39.50, updated) • Add standards for relevancy ranking and results return? • NISO and its members are working on standards
Finding the best • More data can also make finding the best documents harder • For searches on rare items, more data is a win • For all other searches, it’s more likely your answer is in there…but can be a problem too • Why? relevancy is good but… • Relevancy has its limits • “I need information on depression” • “Ok…here are 2,352 articles and 87 books” • Need a dialog…”what kind of depression” …”psychological”…”what about it?” • Underlying problem: most searches are under-specified
One solution: clustering documents • Group results around common themes: same author, web site, journal, subject… • Blurt out largest/most interesting categories: the inarticulate librarian model • Depression psychology, economics, meteorology, antiques… • Psychology treatment of depression, depression symptoms, seasonal affective… • Psychology Kocsis, J. (10), Berg, R. (8), … • Themes could come from static metadata or dynamically by analysis of results text • Static: fixed, clear categories and assignments • Dynamic: doesn’t require metadata/taxonomy
Clustering benefits • Disambiguates and refines search results to get to documents of interest quickly • Can navigate long result lists hierarchically • Would never offer thousands of choices to choose from as input… • Access to bottom of list…maybe just less common • Discovery – new aspects or sources • Can narrow results *after* search • Start with the broadest area search – don’t narrow by subject or other categories first • Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven • Knee surgery cartilage replacement, plastics, …
Answering hard questions • Main problem is still short searches/under-specification • One solution: Relevance feedback – marking good and bad results • A long-standing and proven search refinement technique • More information is better than less • Pseudo-relevancy feedback is a research standard • Most commercial forms not widely used… • …but Pubmed is an exception • A catch: Must first find a good document to be similar to….may be hard or impossible
One solution: descriptive search • Let the user or situation provide the ideal “document” – a full problem description – as input in the first place • Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description • Might draw on user or query context -- user characteristics (MD or nurse), patient record,… • Use thesauri, domain knowledge and limited natural language processing to identify must-have’s • Main focus, pre-existing conditions, etc. • Should provide the best possible search short of real language understanding
Summarize, discover & analyze • How do you summarize a corpus? • May want to report on what’s present, numbers of occurrences, trends, etc. • Ex: What diseases are studied the most? • Must know all diseases and look one by one • How to you find a relationship if you don’t know what relationships exist? • Ex:does gene p53 relate to any disease? • Must check for each possible relationship • Ad hoc analysis • How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence
One solution: text mining • Identify entities (things) in a text corpus • Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… • Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) • Identify relationships: • Through co-occurrence • Relationship presumed from proximity • Example: author-university affiliation • Through limited natural language processing • Semantic relations – causes, is-part-of, etc. • Examples: drug-causes-disease, drug-treats-disease • Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)
Elsevier pilot project • Goal: Demonstrate real value to a working expert in 90 days • Chose biomedical domain • Hired expert to help define entities and relationships • Used 25,000 abstracts from 23 Elsevier journals • Worked with text mining vendor to define and revise extraction of entities and relationships
Pilot scenarios • Answered real questions using real data – not a demo or mock-up • The user: • anyone involved in genomic academic research: a primary researcher, graduate student or post-doc • Scenario 1: Research about gene p53 • What journals should I publish in? • Who’s an expert I can ask for advice? • What connections have been made to my gene? • What organisms have my gene?
Pilot scenarios • Scenario 2: Disease research • What diseases are most researched? • What’s the time trend in HIV research? • What are the centers of HIV research? • Who are the author teams in HIV? • What gene-disease relationships are there? What were they to start in 1996? through 1997? • (Note: Cannot practically answer the above with search alone)
Author teams In HIV research?
Pilot scenarios • Scenario 3: Connections between leukemia and Alzheimer’s • Are there direct connections between leukemia and Alzheimer’s? • What enzymatic activity is associated with leukemia? • Are there indirect connections between leukemia and Alzheimer’s mediated by enzymatic activity?
Indirect links from leukemia to Alzheimer’s via enzymes
Red – Product Pink – Reactant Green – Reagent Brown – Solvent …
The power of text mining • Almost impossible to determine manually • Can provide completely unexpected relationships between source and target • Catch: must do the work domain by domain • Silver lining: can build on preceding work
Long-term: answer any question • Must recognize multiple (any) entities and relationships • Must recognize all forms of linguistic relationship • Must have background of common sense information (or enough entities/relations?) • Information on donors (to political parties) • For now, building text miners, domain by domain, is perhaps the best we can do • Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’
Summary • Need to search more broadly, more easily • Larger databases • Distributed search • Need to locate best documents in even larger (distributed) databases • Clustering to find documents of real interest • Need to answer complex questions • Descriptive search • Need to go beyond search for overviews, relationship discovery and analysis • Text-based data mining • Through text mining (perhaps), approach full natural language understanding