1 / 22

Data- and Compute-Driven Transformation of Modern Science

Data- and Compute-Driven Transformation of Modern Science. Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of Cyberinfrastructure). Profound Transformation of Science Gravitational Physics. Galileo, Newton usher in birth of modern science: c. 1600

dane-neal
Download Presentation

Data- and Compute-Driven Transformation of Modern Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of Cyberinfrastructure)

  2. Profound Transformation of ScienceGravitational Physics • Galileo, Newton usher in birth of modern science: c. 1600 • Problem: single “particle” (apple) in gravitational field (General 2 body-problem already too hard) • Methods • Data: notebooks (Kbytes) • Theory: driven by data • Computation: calculus by hand (1 Flop/s) • Collaboration • 1 brilliant scientist, 1-2 student 2

  3. Profound Transformation of ScienceCollision of Two Black Holes • Science Result • The “Pair of Pants” • Year: 1994 • Team size • ~ 10 • Data produced • ~ 50Mbytes • Science Result • The “Pair of Pants” • Year: 1972 • Team size • 1 person (S. Hawking) • Computation • Flop/s • Data produced • ~ Kbytes (text, hand-drawn sketch) • 400 years later…same! 3

  4. Move to 3D: 1000x more data! • 3D Collision • Science Result • Year: 1998 • Team size • ~ 15 • Data produced • ~ 50Gbytes

  5. Just ahead: Complexity of UniverseLHC, Gamma-ray bursts! • Gamma-ray bursts! • GR now soluble: complex problems in relativistic astrophysics • All energy emitted in lifetime of sun bursts out in a few seconds: what are they?! Colliding BH-NS? SN? • GR, hydrodynamics, nuclear physics, radiation transport, neutrinos, magnetic fields: globally distributed collab! • Scalable algorithms, complex AMR codes, viz, PFlops*week, PB output! • LHC: Higgs particle? • ~10K scientists, 33+ countries, 25PB • Planetary lab for scientific discovery! Remote Instrument Remote Instrument 5

  6. Grand Challenge Communities Combine it All...Where is it going to go? Same CI useful for black holes, hurricanes 6

  7. Framing the QuestionScience is Radically Revolutionized by CI • Modern science • Data- and compute-intensive • Integrative • Multiscale Collaborations for Complexity • Individuals, groups • Teams, communities • Must Transition NSF CI approach to support • Integrative, multiscale • 4 centuries of constancy, 4 decades 109-12 change! …But such radical change cannot be adequately addressed with (our current) incremental approach! We still think like this… Students take note!

  8. NSF Experts Study PCAST Digital Data Wired, Nature Industry Storage Networking Industry Association (SNIA) 100 Year Archive Requirements Survey Report “there is a pending crisis in archiving… we have to create long-term methods for preserving information, for making it available for analysis in the future.” 80% respondents: >50 yrs; 68% > 100 yrs Data Crisis: Information Big Bang Scientific Computing and Imaging Institute, University of Utah

  9. Explosive Trends in Data Growth • Comparative Metagenomics • DNA sequencing of entire families of organisms • Already hundreds of TB, thousands of users • HD Collaborations and Optiportals • Multichannel HD, gigapixel visualizations • Petascale-Exascale simulation • They generate peta-exabytes per simulation! • Square Kilometer Array • 3000 radio receivers, 1 km2 area! • 19 countries! Possibly beginning in 201X, operational 202X • Data: exabyte per week! Analysis: Exaflops! 9

  10. Source: Juliana Freire, U of Utah Provenance in Science When • Provenance is as important as the result • Not a new issue • Lab notebooks have been used for a long time • What is new? • Large volumes of data • Complex analyses—computational processes • Writing notes is no longer an option… • GC Communities require open, sharable data, standards, metadata Annotation Observed data DNA recombination By Lederberg

  11. “Data Deluge” Drives Change at NSF There is a major shift in science towards data-intensive methods. NSF is responding… • Data Issues resonate the most across NSF units! • DataNet – $100M investment in Sustainable Archive & Access Partners – development of widely accessible network of interactive data archives; Driven by today’s grand challenges, integrating multiple disciplines • INTEROP – Community-led interoperability – interdisciplinary, community approaches to combine and re-use data in ways not envisioned by their creators • Data-intensive computing: SDSC Gordon facility • NSF Data Policy – The “Data Working Group” (NSF-wide group of Program Directors) working to assure that data are shared within and across disciplines 11

  12. DataNet Campus DataNet Campus Track 2 Campus Campus Nets Campus Campus Campus Campus Track 2 Track 2 DataNet DataNet DataNet NSF Vision and National CI Blueprint Education Crisis: I need all of this to start to solve my problem! Software Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility? Track 1

  13. The Shift Towards DataImplications • All science is becoming data-dominated • Experiment, computation, theory • Totally new methodologies • Algorithms, mathematics • All disciplines from science and engineering to arts and humanities • End-to-end networking becomes critical part of CI ecosystem • Campuses, please note! • How do we train “data-intensive” scientists? • Data policy becomes critical!

  14. Recent NSF Activities on Data Policy and Implementation

  15. Who pays? The NSF? The Institution? What is the cost model? What is reasonable? Fundamental points on data and publication policy Where is it placed? Author web site? Library? NSF sites? • Publicly funded scientific data and publications should be available, and science benefits • There has to be a place to keep data, and a way to access it • There needs to be an affordable, sustainable cost model for this What data must be made available? Raw data? Peer reviewed? When is it available? 6 months? 1 year? After publication? How long is it made available? How do we enforce it post-award? There is great variability in requirements across science communities: peer review can help guide this process. 15

  16. Changes Coming for Data! • Long-standing NSF Policy on Data (Proposal & Award Policies & Procedures Guide) • “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data... created or gathered in the course of work under NSF grants” • NSF will soon require a Data Management Plan (DMP), subject to peer review; criterion for award • The DMP will be in the form of a 2-page supplementary document to the proposal • It will not be possible to submit proposals without such a document • Customization by discipline, program necessary

  17. Upcoming Implementation of NSF Data Policy • Directorate-Specific Issues & Peer Review • Many details are implemented/enforced via peer review and Program Officer discretion, including things like embargo period, standards, etc. • The challenge at NSF is that “no one size fits all” so each Directorate will be responsible for its own recommendations for DMP content, appropriate institutional repositories, etc. • This does not address Open Access as applied

  18. Electronic Access to Scientific Publications

  19. Why is this Important? • Science requires it • Science progress accelerated by making publications available and searchable • Results in one community need to easily propagate to another for multidisciplinary complex problem solving • Search technologies can be brought to bear • Publications need to be associated with rich information: videos of simulations, supporting data, simulation and analysis codes… • Equality and Broadening participation • Young scientists at smaller universities at a needless disadvantage without it. They may lose journal access. This hurts science and puts talent at risk • US Administration focus on transparency and accountability

  20. Current Activities • We have begun serious discussions within NSF on these issues • National Science Board Committee on Data started • Goals similar to those for Data • We have had numerous visits from funding agencies from around the world • Primary topic: what is NSF doing on OA? • Discussion with various publishers, libraries to explore options • Quality of science relies on peer review systems of best journals; need a way to support OA

  21. On Working with Publishers • Quality of science, identification of talented scientists: we rely on the peer review systems of the best journals • NSF receives an assurance that the work done on a grant meets a standard • Universities use impact factors as part of their tenure and promotion process I believe it is in the interests of science, and hence the public interest, to help journals find a viable OA business model.. Bernard Schutz, Presentation to NSF, May 2009

  22. Final Remarks • Science is becoming collaborative and data dominated • We are accelerating efforts to advance NSF in all aspects of data • Science requires that data need to be open and accessible; we are working to achieve this • All forms of data are important, and must be more tightly connected in the future • Collections, software, publications • Time is of the essence

More Related