1 / 21

Unlocking Novel Hypotheses: Re-Conceptualizing Literature-Based Discovery

Explore the strategy of Literature-Based Discovery (LBD), an under-recognized tool for scientists to generate new hypotheses by combining information from various papers. Learn about the origins, research articles, software, and challenges in LBD development, along with advancements like the two-node search. Discover how implicit information can lead to groundbreaking discoveries in various fields.

leanna
Download Presentation

Unlocking Novel Hypotheses: Re-Conceptualizing Literature-Based Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Re-Conceptualizing Literature-Based Discovery Neil R. Smalheiser March 29, 2008

  2. What is LBD? A strategy for uncovering novel hypotheses • advocated by Don Swanson • Magnesium-migraine, Fish oil-Raynaud’s • The key idea is: putting together explicit assertions from different papers to form new implicit assertions • Regardless of how this is done, how the implicit assertions are assessed, whether the implicit assertions are correct!

  3. What is LBD? A routine way of life for scientists greatly under-recognized! Not just background reading, not just identifying anomalies or “critical incidents” that appear (explicitly) in a paper Since 1996: 8 papers with Swanson, 40 without (i.e. non-one node search), 24 are biological (i.e. non-informatics modeling): 9/24 = 3/8 > 1/3 • Proteins in unexpected locations (Molec. Biol. Cell, 1996) • Expression of reelin in the blood (PNAS, 2000) • Reelin and schizophrenia (PNAS 2000) • Fluoxetine and neurogenesis (Eur. J. Pharmacol. 2001) • RNAi and memory (Trends in Neurosci. 2001) • Bath toys (New Engl. J. Med. 2003) • Dicer and calpain (J. Neurochem. 2005) • Exosomal transfer of proteins & RNAs at synapses (Biol Direct, 2007) • microRNA machinery and regulation by phosphorylation (BBA, 2008)

  4. What is LBD? A body of research articles, software and websites • mostly by information scientists and computer scientists • Mostly concerned with “open discovery” or “one node searches”, begin with a set of articles A that represents a problem • Mostly use “B-terms” present in A to expand the search, find disparate lits Ci that share B-terms with A • Try to find the Ci that is disparate yet “most similar” to A

  5. What is LBD? other researchers employ implicit information too • Bioinformatics • gene-gene interactions • protein-protein interactions • web search • author disambiguation • text mining Yet these are not viewed as examples of LBD for some reason!

  6. Has the LBD field stagnated and not fulfilled its promise? • Kostoff critique(s) • “what is a discovery” vs. an “innovation” • argues against frequency based ranking, • Uses very high recall, hundreds of “discoveries” claimed per question • “Swanson’s legacy” Sw refs ended 2001! • Bork review refs Sw ended 1996! • Few gold standards are available (Mg, fish oil worn out) • Combinatorial explosion A – B – C search method • Impossible standards for what counts as a LBD prediction (never considered, never tested, must shatter a paradigm but must be proven experimentally??) • Excluding active approaches other than “one node search” as being LBD

  7. Well, what DO we know about progress in LBD? • The two-node search • http://arrowsmith.psych.uic.edu • Begin with two lits A and C that represent a known finding or a hypothesis (estrogen-AD) • look for meaningful links • (whether or not A and C are disparate) • We use B-terms extracted from titles • Could use abstracts, MeSH, triples…

  8. Modeling the Two-Node Search-1 • Field testers, free-form use of the tool • Chose 6 two-node searches as gold standards: not too big or small, disparate, topically coherent, clean questions • E.g. for A = retinal detachment, C = aortic aneurysm, a) find diseases in which both features appear [not necessarily in same person] or b) find surgical procedures that have been applied to both conditions. • Manually marked relevant B-terms for a given query (sometimes several queries for the same two node search) • Details in Bioinformatics (2007) paper

  9. Modeling the Two-Node Search-2 • Used 8 complementary features to score each B-term (e.g. recency, frequency, semantic categories) • created a single combined and weighted score for each B-term • Used logistic regression model to optimally give weights to each feature so as to separate marked relevant B-terms from all others (mixed set)

  10. Modeling the Two-Node Search-3

  11. Two End-Points of this Research • For any two-node search, we can now rank the list of B-terms in order of estimated probability that they will be marked as relevant (meaningful) by SOME user for SOME query. • For any pair of lits A and C, we can now estimate the OVERALL shared implicit information between A and C (= % of B-terms that are predicted to be relevant)

  12. Relevance to the One Node Search We can re-conceptualize the one-node search as a series of two-node searches: Choose A, then choose category C Divide category C into many small coherent Ci densely For each Ci, score multi-dimensional features: Including, but not limited to, features that relate A to Ci (e.g. number of B-terms in common or %predicted relevant B-terms) Rank the Ci to identify the most promising lits (which are presumed to point to novel hyps or implicit information helpful when applied to A)

  13. A is evaluated pairwise against C = C1 might involve B-terms C2 might not! C3 C4 ………. e.g. A = Huntington Disease C = lifestyle factors autophagy, or therapeutic agents

  14. “Interestingness” Measures • Field of data mining. • This allows us to encode real-life priorities and strategies of working scientists: • Existing one node search looks for novelty, relevance, non-trivial, likelihood of being true …. [get low hanging fruit] • What about actionability, feasibility of follow-up, surprisingness, cross-discipline, presence of high experimental support, generalizability to other problems, or high potential impact? • A candidate Ci could be interesting because it is recently discovered and rapidly growing (e.g. microRNAs), well characterized, [for a disease] has an animal model, [for a protein] is connected to many other proteins, [for a drug] has FDA approval. • not only re-conceptualizes one node search (e.g., no combinatorial explosion) but it generalizes the ranking methods.

  15. Gold Standards for One-Node Searches • Also, we can now envision preparing a series of gold standard searches, even automatically (cf. TREC 2006, 2007). • Use implicit assertions to reconstruct explicit knowledge. • Use review articles; • lists (e.g. in virus study, gold standard was a list of viruses that were thought to be at risk of being exploited for biological warfare). • time slices; • Avoids the paradox that one node searches must predict things that have no experimental support!

  16. Conclusions • LBD is (can be, will be) alive and well! • Need to incorporate the types of real-life priorities and strategies of working scientists • Re-conceptualize the one node search as a series of two-node searches • Use “interestingness” measures to supplement B-term measures.

  17. Journal of BiomedicalDiscovery and Collaboration • Unique multi-disciplinary audience • People who engage in scientific discovery and collaboration • People who make tools that enhance scientific discovery and collaboration • People who study scientific discovery and collaboration • Hosted by Biomed Central • Fully peer-reviewed • RAPID review (<3 weeks is routine) • Open-access, indexed in PubMed Central et al • Readership goes up 10-100-fold • Impact goes up too… • Article fee reduced or zeroed depending on institution

  18. Acknowledgements • Don Swanson • Vetle Torvik • Wei Zhou (Clement Yu) • Marc Weeber

  19. Ruminations • Should LBD analyses be user-friendly? Popular?? • Don’t they overlook true divergent discoveries? • Should LBD be run automatically as a program in the background, with alerts of possible discoveries? • Does LBD bypass, or reinforce, good old fashioned hypothesis driven science?

More Related