Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Improving Transparency and Reproducibilityof Biomedical ResearchUsing Semantic Technologies Mark Wilkinson World Research & Innovation Congress, Brussels, 2013 Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Making the Web abiomedical research platform from hypothesis through to publication

Motivation: 3 intersecting trends in the Life Sciencesthat are now, or soon will be, extremely problematic

TREND #1 Non-reproducible science & the failure of peer review

Trend #1 Multiple recent surveys of high-throughput biologyreveal that upwards of 50% of published studiesare not reproducible - Baggerly, 2009 - Ioannidis, 2009

Trend #1 Similar (if not worse!) in clinical studies - Begley & Ellis, Nature, 2012 - Booth, Forbes, 2012 - Huang & Gottardo, Briefings in Bioinformatics, 2012

Trend #1 “the most common errors are simple,the most simple errors are common” At least partially because the analytical methodology was inappropriateand/or not sufficiently described - Baggerly, 2009

Trend #1 These errors pass peer review The researcher is (sometimes) unaware of the errorThe process that led to the error is not recorded Therefore it cannot be detected during peer-review

Agencies have Noticed! In March, 2012, the US Institute of Medicine ~said“Enough is enough!”

Agencies have Noticed! Institute of Medicine RecommendationsFor Conduct of High-Throughput Research: Rigorously-described, -annotated, and -followed data management and manipulation procedures “Lock down” the computational analysis pipeline once it has been selected Publish the analytical workflow in a formal manner, together with the full starting and result datasets Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.

TREND #2 Bigger, cheaper data

Trend #2 High-throughput technologies are becomingcheaper and easier to use

Trend #2 High-throughput technologies are becomingcheaper and easier to use But there are still very few experts trained in statistical analysis of high-throughput data

Trend #2 Therefore Even small, moderately-funded laboratories can now afford to produce more data than they can manage or interpret

TREND #3 “The singularity”

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009 Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA June 22, 2012.

“The Singularity” The X-intercept is where, the moment a discovery is made, it is immediately put into practice The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009 Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA June 22, 2012.

You Are Here Scientific research would have to be conducted within a medium that immediately interpreted and disseminated the results...

You Are Here ...in a form that immediately (actively!) affected the results of other researchers...

You Are Here ...without requiring them to be awareof these new discoveries.

3 intersecting and problematic trends Non-reproducible science that passes peer-review Cheaper production of larger and more complex datasetsthat require specialized expertise to analyze properly Need to more rapidly disseminate and use new discoveries

We Want More!

I don’t just want to reproduceyour experiment...

I want to re-use your experiment

In my own laboratory... On MY DATA!

When I do my analysisI want to draw on the knowledgeof global domain-experts likestatisticians and pathologists... ...as if they were mentors sitting in the chair beside me.

Please don’t make me find all of the data and knowledge that I require to do my experiment ...it simply isn’t possible anymore... Image from: Mark Smiciklas Intersection Consulting, cc-nca

I want to support peer review(ers)so that I do better science. Image from AJ Canncc-by-a license

How do we get there from here?

To overcome these intersecting problems and to achieve the goals of transparentreproducible research

We must learn how to do research IN the Web Not OVER the Web

How we use The Web today

The Web is not a pigeon!

Semantic Web Technologies

Design Pattern for Publishing Analytical Tools on the Semantic Web

Application that uses SADIto interpret globally-distributed expert knowledge in order to discover and executethe right tool, at the right time, for the right analysis

CHALLENGE: Reproduce a peer-reviewed scientific publication by semantically modellingthe problem

The Publication Discovering Protein Partners of aHuman Tumor Suppressor Protein

Original Study Simplified Using what is known about protein interactions in fly & yeast predict new interactions with this Human Tumor Suppressor

Semantic Model of the Experiment OWL Web Ontology Language (OWL) is the language approved by the W3C for representing knowledge in the Web

Semantic Model of the Experiment Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model) i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***) ***remember this word!! It will be important later!!

Set-up the Experimental Conditions In a local data-file provide the protein we are interested inand the two species we wish to use in our comparison taxon:9606 a i:OrganismOfInterest . # human uniprot:Q9UK53 a i:ProteinOfInterest . # ING1 taxon:4932 a i:ModelOrganism1 . # yeast taxon:7227 a i:ModelOrganism2 . # fly

Run the Experiment SELECT ?protein FROM <file:/local/workflow.input.n3> WHERE { ?proteinai:ProbableInteractor .}

Run the Experiment SELECT ?protein FROM <file:/local/workflow.input.n3> WHERE { ?proteinai:ProbableInteractor .} This is the URL that leads our computerto the Semantic model of the problem

SHARE examines the semantic model of Probable Interactors Retrieves third-party expertise from the WebDiscusses with SADI what analytical tools are necessaryChooses the right tools for the problem Solves the problem!

SHARE derives (and executes) the following analysis automatically

SHARE is aware of the context of the specific question being asked

There are four very cool things about what you just saw...

Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Presentation Transcript

Empowering Translational Research using Semantic Web Technologies

The Integration of Biological Data Using Semantic Web Technologies

Improving student learning using information technologies

Semantic MEDLINE: Semantic Predications for Biomedical Research

Integrative Biomedical Research Design Patterns, HPC, Semantic Interoperability and Grid

Improving Semantic Search Using Query Log Analysis

Applications of NEXT GENERATION SEQUENCING Technologies on Biomedical Research

High Performance Biomedical Applications Using Cloud Technologies

Securing Web Services Using Semantic Web Technologies

Web Service Brokerage using Semantic Web Technologies

Opportunities and Challenges for using Semantic Technologies

Knowledge and semantic technologies

E-Government Service Integration and Provision Using Semantic Technologies

Improving Transparency and Anti-corruption

Using Proteomics for Biomedical Research

Semantic Sky: Cloud services integration using semantic web technologies

Empowering Translational Research using Semantic Web Technologies

Research Quality and Reproducibility

Hermes: News Personalization Using Semantic Web Technologies

Semantic Network (SN) and Biomedical Ontology

Biomedical Research