380 likes | 386 Views
Explore the advancements in search and discovery technology, including integrated search, distributed search, and finding the best documents through clustering. Presented by Marc Krellenstein, VP of Search and Discovery Advanced Technology Group, on February 5, 2004.
E N D
Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February 5, 2004
Basic search is pretty good • Modern search engines are fast and scalable • Having the data (usually lots) is still key • Can interpret keyword, Boolean and pseudo-natural language queries • Ex: “how to make an international call with my Blackberry” • Spell checking, thesauri and stemming to improve recall • Users are more experienced • More multi-term searches • Gets lots of hits, but that’s usually OK if good ones on top
Basic search is pretty good • Best practice relevancy ranking is good: • Term frequency (TF): more hits count more • Inverse document frequency (IDF): hits of rarer search terms count more • Ex: diabetes diagnosis and treatment • Hits of search terms near each other count more • Ex: penicillin allergy vs. “penicillin allergy” • Hits on metadata (title,subject, etc.) count more • Use anchor text – referring text – as metadata • Items with more links/references to them count more • Authoritative links/referrers count yet more • Many other factors: length, date, etc.
Basic search is pretty good • Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics • But challenges remain…
Current challenges • Integrated search: Content still exists in separate silos • Silos getting bigger but there are still too many • Library patrons have dozens of choices • Putting even more into Google is probably not sufficient to solve the problem • Finding the best/novel documents • Hard to perform complicated searches (e.g., research similar to one’s own) • Historians can’t define a profile… • Discovery • Hard to do more than search: summarize, uncover novelty and relationships, analyze
The integration challenge • Two approaches: • Build even bigger databases (well, yes…) • Not easy, but sometimes the easiest approach • Can be difficult to manage and secure appropriate rights • Distribute search: Search separately managed (or owned) large databases as if they are one • Technically more challenging, but a scalable and maintainable architecture
Distributed search • Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search • Use common metadata scheme (e.g., Dublin Core) and/or determine other common fields or field mappings for each database • Search engine provides parallel search, integrated ranking and integrated results • The separate databases can be maintained and updated separately • Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture • Has contributed specifications to the public domain • Such services can also be offered externally
Distributed search • Simplifies some business issues, but still requires common technology platform • Where common platform not possible, add federated search (i.e., metasearch) • Translate queries • Access and perform parallel search of multiple search engines (vs. multiple databases) • Integrate results as best as possible • Use standards to approximate distributed research • Uniform access, one query language (Z39.50, updated) • Add standards for relevancy ranking and results return? • NISO and its members are working on standards
Finding the best: Navigation • More data can also make finding the best or novel documents harder • For searches for rare items, more data is a win • For all other searches, it’s more likely your answer is in there…but it’s also more likely there’s lots of other stuff close but not as good • Why? relevancy is good but… • Relevancy has its limits…there may be many ‘good’ documents referring to different aspects of the search…the best? • Underlying problems: • User’s needs may not be that specific • Even long searches are under-specified
One solution: clustering documents • Group results around common themes: same subject, author, web site, journal,… • Show largest/most interesting categories • Depression psychology, economics, meteorology, antiques… • Psychology treatment of depression, depression symptoms, seasonal affective… • Psychology Kocsis, J. (10), Berg, R. (8), … • Themes could come from static metadata or dynamically by analysis of results text • Static: fixed, clear categories and assignments • Dynamic: doesn’t require metadata (or controlled vocabulary to draw from)
Clustering benefits • Disambiguates and refines search results to get to documents of interest quickly • Can navigate long result lists hierarchically • Would never offer thousands of choices to choose from as input… • Access to bottom of list…maybe just less common • Discovery – new aspects or sources • Can narrow results *after* search • Start with the broadest area search – don’t narrow by subject or other categories first • Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven • Knee surgery cartilage replacement, plastics, …
Finding the best: Complex search • Main problem is still short searches/under-specification….which the keyword-based ‘enter a query’ paradigm encourages • One solution: Relevance feedback – marking good and bad results • A long-standing and proven search refinement technique • More information is better than less (longer queries are better) • Pseudo-relev feedback is a research standard • Commercial forms – find-similar, etc. --– not widely used (or well executed)... • …but successful in Pubmed (diff users)
Relevance feedback • One catch: Must first find a good document to be similar to • Solution: Let the user provide the ideal document – or a long query or problem statement – as input in the first place • Can enter free text or specific documents describing the interest, e.g., article, grant proposal, experiment description, etc. • Should provide the best possible matches
Discovery challenge: Beyond search • How do you summarize a corpus? • May want to report on what’s present, numbers of occurrences, trends, etc. • Ex: What diseases are studied the most? • Must know all diseases and look one by one • How to you find a relationship if you don’t know what relationships exist? • Ex:does gene p53 relate to any disease? • Must check for each possible relationship • Ad hoc analysis • How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence…
One solution: entity extraction • Identify entities (things) in a text corpus • Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… • Use lexicons, patterns, NLP for finding any or all instances of the entity • Identify relationships: • Through co-occurrence • Relationship presumed from proximity • Example: author-university affiliation • Through limited limited natural language processing • Semantic relations – causes, is-part-of, etc. • Examples: drug-causes-disease…drug-is treatment for-disease…a is suing b…
ClearForest pilot, Fall 2002 • Goal: Demonstrate real value to a working expert in 90 days • Chose biomedical domain • Hired expert to help define entities and relationships • Used 25,000 abstracts from 23 Elsevier journals • Worked with ClearForest to define and revise extraction of entities and relationships • Have related partnership with Stanford for text mining
Pilot scenarios • Answered real questions using real data – not a demo or mock-up • The user: • anyone involved in genomic academic research: a primary researcher, graduate student or post-doc • Scenario 1: Research about gene p53 • What journals should I publish in? • Who’s an expert I can ask for advice? • What connections have been made to my gene? • What organisms have my gene?
Pilot scenarios • Scenario 2: Disease research • What diseases are most researched? • What’s the time trend in HIV research? • What are the centers of HIV research? • Who are the author teams in HIV? • What gene-disease relationships are there? What were they to start in 1996? through 1997? • (Note: Cannot answer the above with search alone)
Author teams In HIV research?
Pilot scenarios • Scenario 3: Connections between leukemia and Alzheimer’s • Are there direct connections between leukemia and Alzheimer’s? • What enzymatic activity is associated with leukemia? • Are there indirect connections between leukemia and Alzheimer’s mediated by enzymatic activity?
Indirect links from leukemia to Alzheimer’s via enzymes
The power of indirect links • Almost impossible to determine manually • Can provide completely unexpected relationships between source and target
The value of analytics • Goes beyond search – summarizes, shows relationships, answers complex questions • A significant value-added service • Value of one new drug discovery?
Summary • Need to search more broadly, more easily • Larger databases • Distributed search • Need to locate best/novel documents in even larger (distributed) databases • Clustering to find documents of real interest • Find/similar, descriptive search • Need to go beyond search for overviews, relationships and discovery • Text-based data mining and entity extraction