Data Driven Discovery in Science

Data Driven Discovery in Science Alex Szalay The Johns Hopkins University

Big Data in Science • Data growing exponentially, in all science • All science is becoming data-driven • This is happening very rapidly • Data becoming increasingly open • Non-incremental! • Convergence of physical and life sciences through Big Data (statistics and computing) • The “long tail” is important • A scientific revolution in how discovery takes place => a rare and unique opportunity

Non-Incremental Changes Astronomy has always been data-driven….now becoming more generally accepted • Multi-faceted challenges • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics… • Need new data intensive scalable architectures • Science is moving increasingly from hypothesis- driven to data-driven discoveries

Scientific Data Analysis Today • Scientific data is doubling every year, reaching PBs • Data is everywhere, never will be at a single location • Architectures increasingly CPU-heavy, IO-poor • Scientists spend 90% of time on data management • Scientists need special features (arrays, GPUs) • Most data analysis on small/midsize BeoWulf clusters • Universities hitting the “power wall” • Soon we cannot even store the incoming data stream • Not scalable, not maintainable…

Data Publishing in Science • Data should be open, science should be reproducible • Publishing data is very difficult today • Too many standards… • Big projects have a mandate, do it anyway • Data sets have a “long tail” • Problem at the tail: many small and intermediate files • Practical issues with business and organizational models • Cloud maybe a better solution for the “long tail” than for the Big Data • Need an “overlay journal” for data

Software in Science • Software increasingly important for science • Scientists spend >10 years to learn/develop tools • Then they are reluctant to switch • They are comfortable in using toolkits, but fewer are willing to write major packages • Most scientific software is based on aged technology • Fortran, little distributed computing • Advanced HPC users still a very small fraction (~1%) • Data-centric computing increasingly important • Much of scientific software is reinvented over and over again

Gray’s Laws of Data Engineering Jim Gray: • Scientific computing is revolving around data • Need scale-out solution for analysis • Take the analysis to the data! • Start with “20 queries” • Go from “working to working”

Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 2.5 Terapixels of images => 5 Tpx • 10 TB of raw data => 120TB processed • 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by • Moore’s Law • Kryder’s Law

Skyserver • Prototype in 21st Century data access • 930 million web hits in 10 years • The world’s most used astronomy facility today • 1,000,000 distinct users vs. 10,000 astronomers • The emergence of the “Internet scientist” • GalaxyZoo (Lintott et al) • 40 million visual galaxy classifications by the public • Enormous publicity (CNN, Times, Washington Post, BBC) • 300,000 people participating, blogs, poems… • Amazing original discoveries (Voorwerp, Green Peas)

Survey Trends in Astronomy T.Tyson (2010)

Impact of Sky Surveys

Immersive Turbulence “… the last unsolved problem of classical physics…” Feynman • Understand the nature of turbulence • Consecutive snapshots of a large simulation of turbulence:now 30 Terabytes • Treat it as an experiment, play withthe database! • Shoot test particles (sensors) from your laptop into the simulation,like in the movie Twister • Next: 70TB MHD simulation • New paradigm for analyzing simulations! with C. Meneveau, S. Chen (Mech. E), G. Eyink (Applied Math), R. Burns (CS)

Increased Diversification One shoe does not fit all! • Diversity grows naturally, no matter what • Evolutionary pressures help • Large floating point calculations move to GPUs • Fast IO moves to high Amdahl number systems • Stream processing emerging • noSQLvs databases vs column store, SciDB, etc • Individual groups want subtle specializations At the same time • What remains in the middle? Data management…. • Boutique systems dead, commodity rules • What is the role of the cloud?

Everything is a Fractal • e-Science needs different tradeoffs from eCommerce • More extended processing, fewer small transactions • Impedance mismatch between monolithic large systems and lots of individual users • Pro: Solve problems that exceed group scale, collaborate • Con: Are we back to centralized research computing?Who do I trust with my data? • Larger systems are more efficient • Smaller systems have more agility • How to make it all play nicely together?

JHU Data-Scope • Funded by NSF MRI to build a new ‘instrument’ to look at data • Goal: 102 servers for $1M + about $200K switches+racks • Two-tier: performance (P) and storage (S) • Large (5PB) + cheap + fast (400+GBps), but …. ..a special purpose instrument

Proposed Projects within JHU 19 projects total proposed for the Data-Scope, more coming, data lifetimes between 3 mo and 3 yrs

Sociology • Broad sociological changes • Data collection in ever larger collaborations (VO) • Analysis decoupled, off archived data by smaller groups • Emergence of the citizen/internet scientist • The impact of power laws: • we need to look at problems in octaves (funding?) • the scientists may be the tail of our users • there is never a discrete end or a sharp edge (except for our funding)

Summary • Science is increasingly driven by large data sets • Large data sets are here, solutions are not • Changing sociology • From hypothesis-driven to data-driven science • We need new instruments: “microscopes” and “telescopes” for data • Need to start training the next generations • П-shaped vs I-shaped people • A new, Fourth Paradigm of Science is emerging… A convergence of statistics, computer science, physical and life sciences…..

Data Driven Discovery in Science

Data Driven Discovery in Science

Presentation Transcript

Unstructured information integration through data-driven similarity discovery

Hypothesis-Driven Science

Discovery Science ISP

Dewey + DDC-driven resource discovery

Discovery-Driven Graph Summarization

Workflow discovery in e-science

Data Discovery

Data-Driven Knowledge Discovery and Philosophy of Science

Discovery Science 2001

ODD-Genes: Accelerating data-driven scientific discovery

Topic Discovery in Text-Driven Social Science Research

Data Discovery

Dewey + DDC-driven resource discovery

Data Discovery

Digital Preservation in Data-Driven Science

Data Driven

Discovery Science

Data Science From Data to Discovery

Excel in Data Science: Your Gateway to Data-Driven Excellence