460 likes | 562 Views
Towards Evidence-Based Discovery. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake cablake@email.unc.edu. Motivation. Relentless increase in electronically available text Life Sciences
E N D
Towards Evidence-Based Discovery Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake cablake@email.unc.edu
Motivation • Relentless increase in electronically available text • Life Sciences • 17 millionth entry added in April 2007 • 5,200 journals indexed • 12,000 new articles each week ! • Chemistry – more than 110,000 articles in 1 year alone • Consequences: • Hundreds of thousands of relevant articles • Implicit connections between literature go unnoticed Shift from Retrieval to Synthesis
Information Overload “One of the diseases of this age is the multiplicity of books; they doth so overcharge the world that it is not able to digest the abundance of idle matter that is every day hatched and brought forth into the world” - Barnaby Rich, 1613
Evidence-Based Discovery Goal: Facilitate Discovery from Text • To make easy or easier1 • A productive insight1 1 American Heritage Dictionary
Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Heterogeneous Literature
Outline • Motivation • Case Studies • METIS • Human synthesis • Natural language processing • Claim Jumping through Scientific Literature • Next Steps • Summary
Systematic Review Process • Formulate the problem • Locate and select studies • Assess quality of studies • Collect data • Analyze and present results • Interpret results • Improve and update review 28 months from initial idea to publication Increased demand due to evidence-based medicine
Guesswork guided by scientifically trained intuition Rescher (1978) Manual Synthesis Select Verify Extract Analyze
Context Information • Study Information • e.g. date, location, ... • Population Information • e.g. gender, age, ... • Risk Factor or Intervention • e.g. duration of exposure, confounders • Disease • e.g. stage, confounders Loosely coupled to review focus Tightly coupled to review focus
Key: Estimate Missing Information 2 1 What are people with Breast Cancer exposed to? What are people in a similar population exposed to? • Facts for each study • number of patients • age of patients • geographic location • risk-factor exposure … • Codebook • question asked • age, gender • % responses Database of risk factors BRFSS Studies with Breast Cancer patients 3 Are these rates significantly different? T. Tengs & N. D. Osgood (2001) “The link between smoking and Impotence: Two Decades of Evidence”, Preventive Medicine, 32:447-52
Information Synthesis Information Synthesis More than Automated Meta-Analysis • Traditional analysis • same study design • medicine = RCT • epidemiology = cohort • Information Synthesis • any study that includes required information • augment missing information Systematic Review Key Main topic Entire study Secondary Information External database
Natural Language Processing Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Heterogeneous Literature
METIS Information Extractor • Semantic Grammar • Features: words, numbers, and semantic types in the Unified Medical Language System (UMLS) • Information extracted : • risk factor exposure (tobacco and alcohol ) gender • age (min, max, mean) start and end dates • number of subjects with medical condition geographical location {term;’age’} {term:’of’} {number;10<n2<110}{term;’to’}{number;10<n2<110} The age of breast cancer subjects ranged between 20 to 64 years old. {semantic type: neoplastic process, or disease}
METIS Info Extractor – Evaluation • Diverse text corpus • epidemiology, surgery, biology, ... • cohort studies, case-control trials, ... • Evaluation • Metrics (precision, recall) • Annotators (developer, domain expert, expert annotator, novice) • Primary topic (breast cancer, impotence) • Secondary information (tobacco and alcohol consumption)
METIS Verifier Converted Article Electronic version of article Verify information extracted
METIS Analyzer • Meta-Analysis • Developed for agricultural application • Requires empirical studies with a quantitative outcome • Unit of study is an article - not a person • Result – a unitless metric called an effect size • Two common meta-analysis techniques • Fixed effects • Randomized-effects model Evaluation: Compared generated effect size with examples in text books and published articles , Result: Same effect size
Alcohol Consumption Synthetic Estimate Evaluation Tobacco Consumption
Outline • Motivation • Case Studies • METIS • Claim Jumping • Human discovery • Natural language processing • Human-assisted discovery and synthesis • Next Steps • Summary
Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Human Discovery and Synthesis Heterogeneous Literature
Human Discovery • Day-to-day activities of scientists reflect • the complex socio-technical environments in which successful creativity tools will eventually be embedded • the human cognitive processing surrounding creativity • Unit of analysis: a paper or grant proposal How do chemists arrive at their research question ? How do chemists transform an idea into a publication ?
Approach • Recruitment • experienced scientists (7-45 yrs) • local chemists and chemical engineers • response rate 84% (21/25) • Semi-structured interviews • Critical incident technique • seminal paper in their field • recent paper authored by the participant • paper authored by the participant that they were particularly proud of
Interview Questions • Discovery Questions • What is your definition of discovery ? • What evidence convinced you that the paper addressed the initial research questions ? • What factors limited the adoption and deployment of the discovery ? • How did you arrive at the research question ? • What if any existing evidence prompted the study/experiment ? • Were there any alternative explanations ? • Information Usage questions • Other than the scientific literature, what information resources do you draw from to aid in your research processes ? • How many articles did you read last month that related to each of those projects ? • Is that typical of how many articles you read in a month for research projects ? • Do you read articles for another purpose ? If so what? • How many hours do you spend reading journal articles for research projects? • Which journals do you typically read and draw from ? • How would you characterize the journals that you read- are they only within your domain, or do you read journals that would be considered non-traditional in your research ? • If you only have a few minutes to read an article, what parts would you read? • What do you do with the article once you have read it ?
Chemists and Chemical Engineers • Compared with other scientists chemists and chemical engineers • read more (Brown,1999) • have more personal subscriptions to journals (Noble & Coughlin, 1997) • spend more time reading (Tenopir & King, 2003) • visit the library more often (Brown, 1999) • Consequences • information disseminated quickly • information has a relative short lifespan
Human Discovery Findings • Discovery definition • Novelty - Balance theory and experimentation • Build on existing ideas - Practical application • Simplicity • Hypothesis generation • Discussion - Previous experiments • Combine expertise - Read literature • Hypothesis validation • Iterative - Tightly coupled
Natural Language Processing Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Heterogeneous Literature
Causal Relationships • Newspaper genre • Causal relationships (Khoo, Chan, & Niu, 1998) • Biomedical genre • Causes and treats (Price & Delcambre, 2005) • Causal knowledge (Khoo, Chan, Niu, 2000) • Universal Grammar • Causatives (Comrie, 1974, 1981) • Action verbs (Thomson, 1987)
Claim Definition • “To assert in the face of possible contradiction” • Example sentence reporting a claim • “This study showed that Tamoxifen reduces the breast cancer risk” • Example Claim Framework • Tamoxifenagent • reduceschange • [breast cancer risk] object
The Claim Framework • Goal • go beyond genes and proteins • differentiate between different levels of confidence in the claim • consider claims made in the full text • Working hypothesis • literature will report findings using constructs within the Claim Framework • human annotators will agree on facets
Preliminary Results • 29 articles from TREC Genomics • Total number of sentences: 5535 • Sentences with >=1 claim: 1250 (22.6%) • Total number of claims: 3228 • Average claims per sentence: 2.51 • Claims that did not fit in the Framework: 31 • Per article • Average number of sentences: 191 • Average number of sentences with >=1 claim:43
Inter Annotator Agreement Information Facet Kappa Agreement Agent 0.71 substantial Object 0.77 substantial Change 0.57 moderate Change+ChangeDir 0.88 almost perfect
Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Human-assisted Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Heterogeneous Literature
Steven W. Matson Ph.D. Professor and Chair Department of Biology Robert C Millikan DVM PhD Barbara Sorenson Hulka Distinguished Professor Department of Epidemiology School of Public Health Dr. Rosa Perelmuter, PhD Director, Moore Undergraduate Research Apprentice Program Professor of Spanish and Assistant Dean, Academic Advising Program Jan F. Prins PhD. Professor of Computer Science and Chairman, Department of Computer Science Alexander Tropsha, Ph.D. Professor and Chair Director, Laboratory for Molecular Modeling Suzanne West, PhD Researcher Health, Social and Economics Research RTI International User Study Timothy S. Carey, MD, MPH Sarah Graham Kenan Professor of Medicine Director, Cecil G Sheps Center for Health Services Research Ila Cote, PhD, DABT Acting Division Director US Environmental Protection Agency National Center for Environmental Assessment Michael T Crimmins PhD. Mary Ann Smith Distinguished Professor of Chemistry UNC and Department Chair, Department of Chemistry Paul Jones Clinical Associate Professor School of Information and Library Science Director of ibiblio.org Rudy L Juliano PhD. Boshamer Distinguished Professor of Pharmacology Principal Investigator, Carolina Center of Cancer Nanotechnology Excellence
Human-assisted Discovery and Synthesis Natural Language Processing Core Genomics Education Discovery Science Evidence-based Practice News Human Discovery and Synthesis Chemistry DocSouth Breast Cancer Synthesis and Discovery Work Practices Heterogeneous Literature
Closing Comments • Accelerate synthesis • Breast cancer study without METIS would take >13 years • Without synthetic estimate = systematic review • Accelerate discovery • Connections between literature • Speculative and orthogonal views • Human discovery and synthesis • As important if not more so than automation “Tap the vast reservoir of human knowledge” Louis Round Wilson, 1929
Acknowledgements • Claim Jumping • Funded in part by • Faculty fellowship from the Renaissance Computing Institute • UNC Faculty Award • Thanks to collaborators • Nassib Nassar and Mats Rynge (RENCI) • Amol Bapat and Ryan Jones (SILS) • Chemists and Chemical Engineers Study • Funded in part by • NSF Center for Environmentally Responsible Solvents and Processes METIS • Funded in part by • California Breast Cancer Research program • University of California, Irvine • Thanks to user groups • Particularly to Dr. Adams and Dr. Tengs • Academic mentoring • Primary Advisor: Dr. Wanda Pratt • Medical Mentor: Dr. Catherine Carpenter • Co-Advisors: Dr Dennis Kibler and Dr Michael Pazzani • Committee Member: Dr Paul Dourish
Questions and Comments Welcome Catherine Blake cablake@email.unc.edu School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake
Publication Bias • Studies that find a correlation between a risk factor and disease are more likely to be published (Easterbrook et al, 1991, Ingelfinger et al, 1994) • METIS provides a new way to explore this bias Bias introduced by authors, editors, funding, ...