Strategies towards improving the utility of scientific big data Evan Bolton, PhD

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH) Sep. 4, 2014

http://www.nlm.nih.gov/

U.S. National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov/

PubChem website https://pubchem.ncbi.nlm.nih.gov/

PubChem primary goal … to be an on-line resource providing comprehensive information on the biological activities of substances where “substance” means any biologically testable entity Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.

PubChem data growth over ten years Chemicals Biological Assays Contributors Protein Targets Tested Chemicals Bioactivity Results +280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets [M=millions, T=thousands, MLP = Molecular Libraries Program]

CAVEAT! All data has “errors”

Big data has “big errors” Hypothetical If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy If you have one trillion facts (10^12), can you accept one million errors (10^9)? Strategies to mitigate errors? Manual curation has its limits (accuracy, cost, time) So .. what do you do?

Error suppression strategies for scientific big data Identify quality {un}known known/unknowns use to formulate an error suppression strategy Perform data normalization improves utility by helping to refine identification “Trust but verify” cross compare authoritative and curated data Consistency filtering improves precision by removal of outliers Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

Error suppression strategies for scientific big data Identify quality {un}known known/unknowns use to formulate an error suppression strategy there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know Feb. 2002 news briefing Ring Closed Salt-form drawing variations are common (+)-Iridodial Defense chemicals from abdominal glands of 13 rove beetle species of subtribe Staphylinina Tautomers and resonance forms of same chemical structure are prolific Ring Open Chemical meaning of a substance may change upon context Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld

Error suppression strategies for scientific big data Perform data normalization improves utility by helping to refine identification • Verify chemical content • Atoms defined/real • Implicit hydrogen • Functional group • Atom valence sanity • Normalize representation • Tautomer invariance • Aromaticity detection • Stereochemistry • Explicit hydrogen • Detect components • Isolate covalent units • Neutralize (+/- proton) • Reprocess • Detect unique • Calculate • Coordinates • Properties • Descriptors

Error suppression strategies for scientific big data “Trust but verify” cross compare authoritative and curated data Cross concept count % CTD HDO KEG MED NDF ORD CTD 100.0 14.3 79.1 40.7 49.7 35.8 HDO 26.0 100.0 38.7 52.4 48.3 26.2 KEG 24.8 6.7 100.0 10.7 6.4 25.2 MED 97.2 68.9 81.6 100.0 93.8 79.6 NDF 30.4 16.3 12.5 24.0 100.0 10.8 ORD 31.9 12.8 71.6 29.7 15.7 100.0 Доверяй, но проверяй (doveryai, no proveryai) Russian proverb used extensively by Ronald Regan when discussing relations with the Soviet Union or John Kerry’s more recent adaption of the phrase when discussing Syria’s chemical weapons disposal: “Verify and verify” Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison. Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan Image credit: http://en.wikipedia.org/wiki/John_Kerry

Error suppression strategies for scientific big data Consistency filtering improves precision by removal of outliers Keep consensus, remove the rest Image credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png

Error suppression strategies for scientific big data Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Prevent error proliferation at the data source, when possible

Error suppression strategies for scientific big data Identify quality {un}known known/unknowns use to formulate an error suppression strategy Perform data normalization improves utility by helping to refine identification “Trust but verify” cross compare authoritative and curated data Consistency filtering improves precision by removal of outliers Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

Okay … now what? … you have cleaned up your data … but it is huge, unwieldy, unstructured How can it be made more useful?

Data organization strategies for scientific big data Crosslink and annotate data provides context and identifies associated concepts Establish similarity schemes enables identification of related records Associate to concept hierarchies improves navigation between related records Perform data reduction suppresses “redundant” information Be succinct simplifies presentation by hiding details

Data organization strategies for scientific big data Crosslink and annotate data provides context and identifies associated concepts Substance Patent Protein inhibit cites encode participates Gene Pathway cites Compound associates Disease Publication ingredient treat cites Drug

Data organization strategies for scientific big data Establish similarity schemes enables identification of related records Vioxx

Data organization strategies for scientific big data Associate to concept hierarchies improves navigation between related records Match to concept = chemical protein gene patent publication pathway … … Organized records Independent hierarchy

Data organization strategies for scientific big data Perform data reduction suppresses “redundant” information Be succinct simplifies presentation by hiding details subject object predicate “subject-predicate-object” “atorvastatinmay treat hypercholesterolemia” Provenance information Evidence citation (PMID) From whom? (Data Source)

Data organization strategies for scientific big data Crosslink and annotate data provides context and identifies associated concepts Establish similarity schemes enables identification of related records Associate to concept hierarchies improves navigation between related records Perform data reduction suppresses “redundant” information Be succinct simplifies presentation by hiding details

Concluding remarks Scientific “big data” … … contains an amazing amount of information … provides opportunities to make discoveries … benefits from strategies to massage it PubChem is doing its part … … making chemical substance data broadly accessible … cross-integrating it to key scientific resources … suppressing errors and their propagation … organizing the data and making it available https://pubchem.ncbi.nlm.nih.gov

PubChem Crew … SiqianHe Sunghwan Kim Ben Shoemaker Paul Thiessen Jiyao Wang Yanli Wang Bo Yu Jian Zhang Steve Bryant TiejunChen Gang Fu Lewis Geer Renata Geer Asta Gindulyte Volker Hahnke Lianyi Han Jane He Special thanks to the NCBI Help Desk, especially Rana Morris

Any questions? If you think of one later, email me: bolton@ncbi.nlm.nih.gov

Strategies towards improving the utility of scientific big data Evan Bolton, PhD

Strategies towards improving the utility of scientific big data Evan Bolton, PhD

Presentation Transcript

The Bolton Improving Care System BICS Academy

Utility of Employment Data

UNLOCKING THE BIG PROMISE OF BIG DATA

Strategies to Enhance the Utility of Data in ImmPort

Towards the Smarter Utility

Improving Openness of Scientific Research

The Role of “Big Data” in Scientific Publishing

Big Data Infrastructure for Scientific Computing

Strategies for Economic Sustainability of Scientific Data Infrastructure

The Representation of Scientific Data

The Age of Big Data

The Data Grid: Towards an Architecture for the Large Scientific Datasets

The Era of Big Data

University of Bolton

Towards improving data presentation in the TripCheck system

The Role of Big Data in Improving Public Transport

Big Data Big Data

The Age of Big Data