XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand

XSEDE14 Reproducibility Workshop:Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility Workshop

Reproducibility – what it means to me • Full documentation of how an experiment (simulation) was conducted • Source code (unique versioning) • Input data • Computing environment • Hardware • Software (probably the most lacking component) • How often are the OS, compilers, MPI versions fully documented so that one knows how to reproduce the build environment • Ever seen a list of all the libraries linked into a code and the version of each library? • Published results XSEDE14 - Reproducibility Workshop

Computing Center responsibilities • Yale Report makes no mention of the role of computing centers • I believe computing centers have an obligation to help solve some of the reproducibility issues • Namely documentation of the software environment • Expecting a researcher to document all the system software in a complete way is asking too much • A researcher may not know what should be documented XSEDE14 - Reproducibility Workshop

What can/should be done • Need an automatic way to collect the information on the software (and versions) used by the researcher This is what the centers (national and campus level) should be providing • A couple prototypes exist that do this • For example, NICS and TACC provide two similar but slightly different prototypes (ALTD and Lariat, respectively) that capture the libraries and their versions for each code built and run • Solves part of the documentation problem; in fact NERSC uses so ALTD so that users can find out provenance data from old builds so they can rebuild their codes exactly like they did months or years before • A new effort (called XALT) is under development to combine and extend these prototypes from NICS and TACC to capture even more information – everything mentioned above • Every center should be doing this for a variety of reasons • better user support; efficient use of staff resources • provenance data collection • security related concerns • And of course documentation for reproducibility • Collecting this information is very doable (as proven by the prototypes) and has proven to be very useful. It would help the researchers greatly with providing the information the Yale report recommends XSEDE14 - Reproducibility Workshop

What can/should be done (2) • Computing centers (university and national level) can also somewhat address repository and software versioning issues for researchers by providing snapshots of the OS and libraries and providing views into databases • Centers could document each and every version of all of their software and the duration it was the default on the machine • There are already efforts to capture most of this information at some centers. • For example, at NICS, the programming environment software versioning is documented • Provide users a list of all the system defaults at any time from the past with “all-in-one” modules • Centers could make RPM bundles of the system software and provide a test bed cluster with which one could “revert” to past system software installations to confirm reproducibility • Only for the life of the technology/award • Test bed clusters are sometimes not part of HPC deployments • Test bed clusters would likely be only a few nodes, unable to reproduce large simulations XSEDE14 - Reproducibility Workshop

XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand

XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand

Presentation Transcript

AnyGL: A Large Scale Hybrid Distributed Graphics System

Large-Scale SQL Server Deployments for DBAs

4. SCALE-UP OF BIOREACTOR SYSTEMS

Impact, Washback and Consequences of Large-scale Testing

Introduction to Large Scale Modeling Systems

Large-Scale Copy Detection

Thesis Defense Large -Scale Graph Computation on Just a PC

Understanding the Viability of Large-Scale System Designs

Cloud Computing in Libraries and Web-scale Library Management and Discovery

GraphChi : Large-Scale Graph Computation on Just a PC

Session 8 Paying for Large-Scale Disasters

Walking on the Weak Scale with/without Extra Dimensions

David Mihm Director of Local Search Strategy, Moz

Perspective on Extreme Scale Computing in China

A Few Words on NERSC Before UPC

Scalable Web Architectures

CS 54001-1: Large-Scale Networked Systems

Week 4 The Large Scale Universe

Small Feature Reproducibility A Focus on Photolithography

Large Scale Studies of Dyslexia in Florida