210 likes | 271 Views
Re-Conceptualizing Literature-Based Discovery. Neil R. Smalheiser March 29, 2008. What is LBD? A strategy for uncovering novel hypotheses. advocated by Don Swanson Magnesium-migraine, Fish oil-Raynaud’s The key idea is: putting together
E N D
Re-Conceptualizing Literature-Based Discovery Neil R. Smalheiser March 29, 2008
What is LBD? A strategy for uncovering novel hypotheses • advocated by Don Swanson • Magnesium-migraine, Fish oil-Raynaud’s • The key idea is: putting together explicit assertions from different papers to form new implicit assertions • Regardless of how this is done, how the implicit assertions are assessed, whether the implicit assertions are correct!
What is LBD? A routine way of life for scientists greatly under-recognized! Not just background reading, not just identifying anomalies or “critical incidents” that appear (explicitly) in a paper Since 1996: 8 papers with Swanson, 40 without (i.e. non-one node search), 24 are biological (i.e. non-informatics modeling): 9/24 = 3/8 > 1/3 • Proteins in unexpected locations (Molec. Biol. Cell, 1996) • Expression of reelin in the blood (PNAS, 2000) • Reelin and schizophrenia (PNAS 2000) • Fluoxetine and neurogenesis (Eur. J. Pharmacol. 2001) • RNAi and memory (Trends in Neurosci. 2001) • Bath toys (New Engl. J. Med. 2003) • Dicer and calpain (J. Neurochem. 2005) • Exosomal transfer of proteins & RNAs at synapses (Biol Direct, 2007) • microRNA machinery and regulation by phosphorylation (BBA, 2008)
What is LBD? A body of research articles, software and websites • mostly by information scientists and computer scientists • Mostly concerned with “open discovery” or “one node searches”, begin with a set of articles A that represents a problem • Mostly use “B-terms” present in A to expand the search, find disparate lits Ci that share B-terms with A • Try to find the Ci that is disparate yet “most similar” to A
What is LBD? other researchers employ implicit information too • Bioinformatics • gene-gene interactions • protein-protein interactions • web search • author disambiguation • text mining Yet these are not viewed as examples of LBD for some reason!
Has the LBD field stagnated and not fulfilled its promise? • Kostoff critique(s) • “what is a discovery” vs. an “innovation” • argues against frequency based ranking, • Uses very high recall, hundreds of “discoveries” claimed per question • “Swanson’s legacy” Sw refs ended 2001! • Bork review refs Sw ended 1996! • Few gold standards are available (Mg, fish oil worn out) • Combinatorial explosion A – B – C search method • Impossible standards for what counts as a LBD prediction (never considered, never tested, must shatter a paradigm but must be proven experimentally??) • Excluding active approaches other than “one node search” as being LBD
Well, what DO we know about progress in LBD? • The two-node search • http://arrowsmith.psych.uic.edu • Begin with two lits A and C that represent a known finding or a hypothesis (estrogen-AD) • look for meaningful links • (whether or not A and C are disparate) • We use B-terms extracted from titles • Could use abstracts, MeSH, triples…
Modeling the Two-Node Search-1 • Field testers, free-form use of the tool • Chose 6 two-node searches as gold standards: not too big or small, disparate, topically coherent, clean questions • E.g. for A = retinal detachment, C = aortic aneurysm, a) find diseases in which both features appear [not necessarily in same person] or b) find surgical procedures that have been applied to both conditions. • Manually marked relevant B-terms for a given query (sometimes several queries for the same two node search) • Details in Bioinformatics (2007) paper
Modeling the Two-Node Search-2 • Used 8 complementary features to score each B-term (e.g. recency, frequency, semantic categories) • created a single combined and weighted score for each B-term • Used logistic regression model to optimally give weights to each feature so as to separate marked relevant B-terms from all others (mixed set)
Two End-Points of this Research • For any two-node search, we can now rank the list of B-terms in order of estimated probability that they will be marked as relevant (meaningful) by SOME user for SOME query. • For any pair of lits A and C, we can now estimate the OVERALL shared implicit information between A and C (= % of B-terms that are predicted to be relevant)
Relevance to the One Node Search We can re-conceptualize the one-node search as a series of two-node searches: Choose A, then choose category C Divide category C into many small coherent Ci densely For each Ci, score multi-dimensional features: Including, but not limited to, features that relate A to Ci (e.g. number of B-terms in common or %predicted relevant B-terms) Rank the Ci to identify the most promising lits (which are presumed to point to novel hyps or implicit information helpful when applied to A)
A is evaluated pairwise against C = C1 might involve B-terms C2 might not! C3 C4 ………. e.g. A = Huntington Disease C = lifestyle factors autophagy, or therapeutic agents
“Interestingness” Measures • Field of data mining. • This allows us to encode real-life priorities and strategies of working scientists: • Existing one node search looks for novelty, relevance, non-trivial, likelihood of being true …. [get low hanging fruit] • What about actionability, feasibility of follow-up, surprisingness, cross-discipline, presence of high experimental support, generalizability to other problems, or high potential impact? • A candidate Ci could be interesting because it is recently discovered and rapidly growing (e.g. microRNAs), well characterized, [for a disease] has an animal model, [for a protein] is connected to many other proteins, [for a drug] has FDA approval. • not only re-conceptualizes one node search (e.g., no combinatorial explosion) but it generalizes the ranking methods.
Gold Standards for One-Node Searches • Also, we can now envision preparing a series of gold standard searches, even automatically (cf. TREC 2006, 2007). • Use implicit assertions to reconstruct explicit knowledge. • Use review articles; • lists (e.g. in virus study, gold standard was a list of viruses that were thought to be at risk of being exploited for biological warfare). • time slices; • Avoids the paradox that one node searches must predict things that have no experimental support!
Conclusions • LBD is (can be, will be) alive and well! • Need to incorporate the types of real-life priorities and strategies of working scientists • Re-conceptualize the one node search as a series of two-node searches • Use “interestingness” measures to supplement B-term measures.
Journal of BiomedicalDiscovery and Collaboration • Unique multi-disciplinary audience • People who engage in scientific discovery and collaboration • People who make tools that enhance scientific discovery and collaboration • People who study scientific discovery and collaboration • Hosted by Biomed Central • Fully peer-reviewed • RAPID review (<3 weeks is routine) • Open-access, indexed in PubMed Central et al • Readership goes up 10-100-fold • Impact goes up too… • Article fee reduced or zeroed depending on institution
Acknowledgements • Don Swanson • Vetle Torvik • Wei Zhou (Clement Yu) • Marc Weeber
Ruminations • Should LBD analyses be user-friendly? Popular?? • Don’t they overlook true divergent discoveries? • Should LBD be run automatically as a program in the background, with alerts of possible discoveries? • Does LBD bypass, or reinforce, good old fashioned hypothesis driven science?