Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009

Data- Driven Computational Science and Future Architectures at the Pittsburgh Supercomputing Center Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009

NSF TeraGrid Cyberinfrastructure • Mission: Advancing scientific research capability through advanced IT • Resources: Computational, Data Storage, Instruments, Network

Now is a Resource Rich Time • NSF has funded two very large distributed memory machines available to the national research community • Trk2a-Texas- Ranger (62,976 cores, 579 Teraflops, 123 TB memory) • Trk2b-Tennessee- Kraken (18048 cores, 166 Teraflops, 18 TB memory) growing to close to a Petaflop • Track 2d: data centric; experimental architecture; … proposals in review • All part of TeraGrid. Largest single allocation this past September was 46M processor hours. • In 2011, NCSA is going to field a 10 PF machine.

Increasing Importance of Data in Scientific Discovery Large amounts from instruments and sensors. • Genomics. • Large Hadron Collider • Huge astronomy data bases • Sloan Digital Sky Survey • Pan Starrs • Large Scale Synoptic Telescope Results of large simulations ( CFD, MD, cosmology,…)

Insight by VolumeNIST Machine Translation Contest • In 2005, Google beat all the experts by exploiting 200 billion words of documents (Arabic to English, UN high quality translation), and looking at all 1- word, 2-word,…5 word phrases, and estimating their best translation. Then applied that to the test text. • No one on the Google team spoke Arabic or understood its syntax! • Results depend critically on the volume of text analyzed. 1 billion words would not have sufficed

What computer architecture is best for data intensive work? Based on discussions with many communities, we believe that a complementary architecture embodying large shared memory will be invaluable • Large graph algorithms (many fields including web analysis, bioinformatics, …) • Rapid assessment of data analysis ideas, using OpenMP rather than MPI, and with access to large data

History of first or early systems 7

PSC Facilities XT3 (BigBen)4136p, 22 TFlop/s Altix (Pople) 768 p, 1.5TB shared memory Visualization Nodes NVidia Quadro4 980XGL Storage Cache Nodes 100 TB Storage Silos 2 PB DMF Archive Server

PSC Shared Memory Systems • Pople- introduced this March 2008 • SGI Altix 4700, 768 Intel cores, 1.5 TB coherent shared memory, Numalink Interconnect • Highly oversubscribed • Already stimulated work in new areas, because of perceived ease of programming in shared memory • Game theory (Poker), • Epidemiological modeling • Social network analysis: • Economics of Internet Connectivity: • fMRI study of Cognition:

Desiderata for New System • Powerful Performance • Programmability • Support for current applications • Support for a host of new applications and science communities.

Proposed Track 2 System at PSC • Combines next generation Intel processors (Nehalem EX) with SGI next generation interconnect technology, (NUMAlink-5) • ~100,000 cores, ~100TB memory, ~1 Pf peak • At least 4TB coherent shared memory components, with full globally addressable memory • Superb MPI and IO performance

Accelerated Performance • MPI Offload Engine (MOE) • Frees CPU from MPI activity • Faster Reductions (2-3x compared to competitive clusters/MPPs) • Order of magnitude faster barriers and random access • NUMAlink 5 Advantage • 2-3× MPI latency improvement • 3× bandwidth of InfiniBand QDR • Special support for block transfer and global operations • Massively Memory-mapped I/O • Under user control • Big speedup for I/O bound apps

Enhanced productivity from Shared Memory • Easier shared memory programming for rapid development/prototyping • Will allow large scale generation of data, and analysis on the same platform without moving- (a major problem for current Track2 systems) • Mixed shared memory/ MPI programming between much larger blocks (e.g. Woodward’s PPM code or example below)

MPI, shmem HybridMPI/OpenMP, MPI/threaded Charm++ PGAS UPC, CAF Coherent SharedMemory OpenMP,pthreads HighProductivity Star-P: parallelMATLAB, Python, R High-Productivity, High-PerformanceProgramming Models The T2c system will support programming models for: • extreme capability • algorithm expression • user productivity • workflows

Programming ModelsPetascale Capability Applications • Full-system applications will run in any of 4 programming models • Dual emphasis on performance and productivity • Existing codes • Optimization for multicore • New and rewritten applications MPI, shmem HybridMPI/OpenMP, MPI/threaded Charm++ PGAS UPC, CAF Coherent SharedMemory OpenMP,pthreads HighProductivity Star-P: parallelMATLAB, Python, R

Programming ModelsHigh Productivity Supercomputing • Algorithm development • Rapid prototyping • Interactive simulation • Also: • Analysis and visualization • Computational steering • Workflows MPI, shmem HybridMPI/OpenMP, MPI/threaded Charm++ PGAS UPC, CAF Coherent SharedMemory OpenMP,pthreads HighProductivity Star-P: parallelMATLAB, Python, R

Programming ModelsNew Research Communities • multi-TB coherent shared memory • Global address space • Express algorithms not served by distributed systems • Complex, dynamic connectivity • Simplify load balancing MPI, shmem HybridMPI/OpenMP, MPI/threaded Charm++ PGAS UPC, CAF Coherent SharedMemory OpenMP,pthreads HighProductivity Star-P: parallelMATLAB, Python, R

Enhanced Service for Current Power UsersAnalyze Massive Data where you produce it • Combines superb MPI performance with shared memory and higher level languages for rapid analysis prototyping,

Analysis of Seismology Simulation Results • Validation across models (Quake: CMU, AWM: SCEC) 4D waveform output at 2Hz (to address civil engineering structures) for 200s earthquake simulations will generate hundreds of TB of output. • Voxel by voxel comparison is not an appropriate comparison technique. PSC developed data-intensive statistical analysis tools to understand subtle differences in these vast spatiotemporal datasets. • required having substantial windowsof both datasets in memory to compare

Design of LSST Detectors • Gravitational lensing can map the distribution of dark matter in the Universe and make estimates of Dark Energy content more accurate. • Measurements are very subtle. • High quality modeling, with robust statistics, is needed for LSST detector design. • Must calculate ~10,000 light cones through each simulated universe. • Each universe is 30TB. • Each light cone calculation requires analyzing large chunks of the entire dataset..

Understanding the Processes that DriveStress-Corrosion Cracking (SCC) • Stress-corrosion cracking affects the safe, reliable performance of buildings, dams, bridges, and vehicles. • Corrosion costs the U.S. economy about 3% of GDP annually. • Predicting the lifetime beyond which SCC may causefailure requires multiscale simulations that couplequantum, atomistic, and structural scales. • 100-300nm, 1-10 million atoms, over 1-5 μs, 1 fstimestep • Efficient execution requires large SMP nodes to minimize surface-to-volume communication, large cache capacity,andhigh-bandwidth, low-latency communications. • expected to achieve the ~1000 timesteps per second needed for realistic simulation of stress-corrosion cracking. Courtesy of PriyaVashishta, USC A crack in the surface of a piece of metal grows from activity of atoms at the point of cracking. Quantum-level simulation (right panel) leads to modeling the consequences (left panel). From http://viterbi.usc.edu/news/news/2004/2004_10_08_corrosion.htm

Analyzing the Spread of Pandemics • Understanding the spread of infectiousdiseases is critical for effective responseto disease outbreaks(e.g. avian flu). • EpiFast: a fast, reliable method for simulating pandemics, based on a combinatorial interpretation of percolation on directed networks • Madhav Marathe, Keith Bisset, et al.,Network Dynamics and SimulationsScience Laboratory (NDSSL) at Virginia Tech • Large shared memory is needed for efficientimplementation of graph theoretic algorithmsto simulate transmission networks that modelhow disease spreads from one individual to the next. • 4TB of shared memory will allow study of world-wide pandemics. From Karla Atkins et al., An Interaction Based Composable Architecture for Building Scalable Models of Large Social, Biological, Information and Technical Systems, CTWatch Quarterly March 2008 http://www.ctwatch.org

Engaging New CommunitiesMemory-Intensive Graph Algorithms • Web analytics • Applications: fight spam, rank importance, cluster information, determine communities • Algorithms are notoriously hard to implement on distributed memory machines. Link • 1010 pages • 1011 links • 40 bytes/link → 4TB web page courtesy Guy Blelloch (CMU)

More Memory-Intensive Graph Algorithms interaction session protein IP packet biological pathways computer security adjacency common receipt item word Also: epidemiology,social networks, … courtesy Guy Blelloch (CMU) analyzing buying habits machine translation

PSC T2c: Summary • PSC’s T2c system, when awarded, will leverage architectural innovations in the processor (Intel Nehalem-EX) and the platform (SGI Project Ultraviolet) to enable groundbreaking science and engineering simulations using both “traditional HPC” and emerging paradigms • Complement and dramatically extend existing NSF program capabilities • Usability features will be transformative • Unprecedented range of target communities • perennial computational scientists • algorithm developers, especially those tackling irregular problems • data-intensive and memory-intensive fields • highly dynamic workflows (modify code, run, modify code again, run again, …) • Reduced concept-to-results time transforming NSF user productivity

Integrated in National Cyberinfrastructure • Enabled and supported by PSC’s advanced user support,application and system optimization, middleware and infrastructure, and leveraging national CyberInfrastructure

Questions?

Predicting Mesoscale Atmospheric Phenomena • Accurate prediction of atmosphericphenomena at the 1-100km scale isneeded to reduce economic lossesand injuries due to strong storms. • To achieve this, we require 20-memberensemble runs of 1 km resolution,covering the Continental US, withdynamic data assimilation inquasi-real time. • Ming Xue, University of Oklahoma • Reaching 1.0-1.5 km resolution is critical.(In certain weather situations, fewerensemble members may suffice.) • Expected to sustain 200 Tf/s for WRF, enabling prediction of atmospheric phenomena at the mesoscale. Fanyou Kong et al., Real-Time Storm-Scale Ensemble Forecast Experiment – Analysis of 2008 Spring Experiment Data, Preprints, 24th Conf. on Severe Local Storm, Amer. Metor. Soc., 27-31 October 2008.http://twister.ou.edu/papers/Kong_24thSLS_extendedabs-2008.pdf

Reliability • Hardware-enabled fault detection, prevention, containment • Enhanced monitoring and serviceability • Numalink automatic retry, various error correcting mechanisms

Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009