1 / 18

Data Driven Discovery in Science

Data Driven Discovery in Science. Alex Szalay The Johns Hopkins University. Big Data in Science. Data growing exponentially, in all science All science is becoming data-driven This is happening very rapidly Data becoming increasingly open Non-incremental!

shanta
Download Presentation

Data Driven Discovery in Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Driven Discovery in Science Alex Szalay The Johns Hopkins University

  2. Big Data in Science • Data growing exponentially, in all science • All science is becoming data-driven • This is happening very rapidly • Data becoming increasingly open • Non-incremental! • Convergence of physical and life sciences through Big Data (statistics and computing) • The “long tail” is important • A scientific revolution in how discovery takes place => a rare and unique opportunity

  3. Non-Incremental Changes Astronomy has always been data-driven….now becoming more generally accepted • Multi-faceted challenges • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics… • Need new data intensive scalable architectures • Science is moving increasingly from hypothesis- driven to data-driven discoveries

  4. Scientific Data Analysis Today • Scientific data is doubling every year, reaching PBs • Data is everywhere, never will be at a single location • Architectures increasingly CPU-heavy, IO-poor • Scientists spend 90% of time on data management • Scientists need special features (arrays, GPUs) • Most data analysis on small/midsize BeoWulf clusters • Universities hitting the “power wall” • Soon we cannot even store the incoming data stream • Not scalable, not maintainable…

  5. Data Publishing in Science • Data should be open, science should be reproducible • Publishing data is very difficult today • Too many standards… • Big projects have a mandate, do it anyway • Data sets have a “long tail” • Problem at the tail: many small and intermediate files • Practical issues with business and organizational models • Cloud maybe a better solution for the “long tail” than for the Big Data • Need an “overlay journal” for data

  6. Software in Science • Software increasingly important for science • Scientists spend >10 years to learn/develop tools • Then they are reluctant to switch • They are comfortable in using toolkits, but fewer are willing to write major packages • Most scientific software is based on aged technology • Fortran, little distributed computing • Advanced HPC users still a very small fraction (~1%) • Data-centric computing increasingly important • Much of scientific software is reinvented over and over again

  7. Gray’s Laws of Data Engineering Jim Gray: • Scientific computing is revolving around data • Need scale-out solution for analysis • Take the analysis to the data! • Start with “20 queries” • Go from “working to working”

  8. Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 2.5 Terapixels of images => 5 Tpx • 10 TB of raw data => 120TB processed • 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by • Moore’s Law • Kryder’s Law

  9. Skyserver • Prototype in 21st Century data access • 930 million web hits in 10 years • The world’s most used astronomy facility today • 1,000,000 distinct users vs. 10,000 astronomers • The emergence of the “Internet scientist” • GalaxyZoo (Lintott et al) • 40 million visual galaxy classifications by the public • Enormous publicity (CNN, Times, Washington Post, BBC) • 300,000 people participating, blogs, poems… • Amazing original discoveries (Voorwerp, Green Peas)

  10. Survey Trends in Astronomy T.Tyson (2010)

  11. Impact of Sky Surveys

  12. Immersive Turbulence “… the last unsolved problem of classical physics…” Feynman • Understand the nature of turbulence • Consecutive snapshots of a large simulation of turbulence:now 30 Terabytes • Treat it as an experiment, play withthe database! • Shoot test particles (sensors) from your laptop into the simulation,like in the movie Twister • Next: 70TB MHD simulation • New paradigm for analyzing simulations! with C. Meneveau, S. Chen (Mech. E), G. Eyink (Applied Math), R. Burns (CS)

  13. Increased Diversification One shoe does not fit all! • Diversity grows naturally, no matter what • Evolutionary pressures help • Large floating point calculations move to GPUs • Fast IO moves to high Amdahl number systems • Stream processing emerging • noSQLvs databases vs column store, SciDB, etc • Individual groups want subtle specializations At the same time • What remains in the middle? Data management…. • Boutique systems dead, commodity rules • What is the role of the cloud?

  14. Everything is a Fractal • e-Science needs different tradeoffs from eCommerce • More extended processing, fewer small transactions • Impedance mismatch between monolithic large systems and lots of individual users • Pro: Solve problems that exceed group scale, collaborate • Con: Are we back to centralized research computing?Who do I trust with my data? • Larger systems are more efficient • Smaller systems have more agility • How to make it all play nicely together?

  15. JHU Data-Scope • Funded by NSF MRI to build a new ‘instrument’ to look at data • Goal: 102 servers for $1M + about $200K switches+racks • Two-tier: performance (P) and storage (S) • Large (5PB) + cheap + fast (400+GBps), but …. ..a special purpose instrument

  16. Proposed Projects within JHU 19 projects total proposed for the Data-Scope, more coming, data lifetimes between 3 mo and 3 yrs

  17. Sociology • Broad sociological changes • Data collection in ever larger collaborations (VO) • Analysis decoupled, off archived data by smaller groups • Emergence of the citizen/internet scientist • The impact of power laws: • we need to look at problems in octaves (funding?) • the scientists may be the tail of our users • there is never a discrete end or a sharp edge (except for our funding)

  18. Summary • Science is increasingly driven by large data sets • Large data sets are here, solutions are not • Changing sociology • From hypothesis-driven to data-driven science • We need new instruments: “microscopes” and “telescopes” for data • Need to start training the next generations • П-shaped vs I-shaped people • A new, Fourth Paradigm of Science is emerging… A convergence of statistics, computer science, physical and life sciences…..

More Related