Extreme Scalability Working Group (XS-WG): Status Update

Extreme Scalability Working Group (XS-WG):Status Update Nick Nystrom Director, Strategic Applications Pittsburgh Supercomputing CenterOctober 22, 2009

Extreme Scalability Working Group (XS-WG): Purpose • Meet the challenges and opportunities of deploying extreme-scale resources into the TeraGrid, maximizing both scientific outputand user productivity. • Aggregate, develop, and share wisdom • Identify and address needs that are common to multiple sites and projects • May require assembling teams and obtaining support for sustained effort • XS-WG benefits from active involvement of all Track 2 sites, BlueWaters, tool developers, andusers. • The XS-WG leverages and combines RPs’ interests to deliver greater value to the computational science community.

XS-WG Participants • Amit Majumdar SDSC, TG AUS AD • Mahin Mahmoodi PSC, Tools lead • Allen Malony Univ. of Oregon(P) • David O’Neal PSC • Dmitry Pekurovsky SDSC • Wayne Pfeiffer SDSC • Raghu Reddy PSC, Scalability lead • Sergiu Sanielevici PSC • Sameer Shende Univ. of Oregon(P) • Ray Sheppard IU • Alan Snavely SDSC • Henry Tufo NCAR • George Turner IU • John Urbanic PSC • Joel Welling PSC • Nick Wright SDSC(P) • S. Levent Yilmaz* CSM, U. Pittsburgh(P) • Nick Nystrom PSC, XS-WG lead • Jay Alameda NCSA • Martin Berzins Univ. of Utah(U) • Paul Brown IU • Shawn Brown PSC • Lonnie Crosby NICS, IO/Workflows lead • Tim Dudek GIG EOT • Victor Eijkhout TACC • Jeff Gardner U. Washington(U) • Chris Hempel TACC • Ken Jansen RPI(U) • Shantenu Jha LONI • Nick Karonis NIU(G) • Dan Katz LONI • Ricky Kendall ORNL • Byoung-Do Kim TACC • Scott Lathrop GIG, EOT AD • Vickie Lynch ORNL • U: user; P: performance tooldeveloper; G: grid infrastructure developer; *: joined XS-WG since last TG-ARCH update

Technical Challenge Area #1:Scalability and Architecture • Algorithms, numerics, multicore, etc. • Robust, scalable infrastructure (libraries, frameworks, languages) for supporting applications that scale to O(104–6) cores • Numerical stability and convergence issues that emerge at scale • Exploiting systems’ architectural strengths • Fault tolerance and resilience • Contributors • POC: Raghu Reddy (PSC) • Members: Reddy, Majumdar, Urbanic, Kim, Lynch, Jha, Nystrom • Current activities • Understanding performance tradeoffs in hierarchical architectures • e.g. partitioning between MPI/OpenMP for different node architectures, interconnects, and software stacks • candidate codes for benchmarking: HOMB, WRF, perhaps others • Characterizing bandwidth-intensive communication performance

Investigating the Effectiveness of Hybrid Programming (MPI+OpenMP) • Begun in XS-WG, extended through AUS effort in collaboration with Amit Majumdar • Examples of applications with hybrid implementations: WRF, POP, ENZO • To exploit more memory per task, threading offers clear benefits. • But what about performance? • Prior results are mixed; pure MPI often seems at least as good. • Historically, systems had fewer cores/socket and fewer cores/node than we have today, and far fewer than they will have in the future. • Have OpenMP versions been as carefully optimized? • Reasons to look into hybrid implementations now • Current T2 systems have 8-16 cores per node. • Are we at the tipping point for threading offering a win? If not, is there one, and at what core count, and for which kinds of algorithms? • What is the potential for performance improvement?

Hybrid OpenMP-MPI Benchmark (HOMB) • Developed by Jordan Soyke, while a student intern at PSC, subsequently enhanced by Raghu Reddy • Simple benchmark code • Permits systematic evaluation by • Varying computation-communication ratio • Varying message sizes • Varying MPI vs. OpenMP balance • Allows characterization of performance bounds • Characterizing the potential hybrid performance of an actual application is possible with adequate understanding of its algorithms and their implementations.

Characteristics of the Benchmark • Perfectly parallel with both MPI/OpenMP • Perfectly load balanced • Distinct computation and communication sections • Only nearest-neighbor communication • Currently no reduction operations • No overlap of computation and communication • Can easily vary computation/communication ratio • Current tests are with large messages

Preliminary Results on Kraken:MPI vs. MPI+OpenMP, 12 threads/node • The hybrid approach provides increasing performance advantage as communication fraction increases. • … for the current core count per node. • Non-threaded sections of an actual application would have an Amdahl’s Law effect; these results constitute a best case limit. • Hybrid could be beneficial because of other reasons: • Application has limited scalability because of the decomposition • Application needs more memory • Application has dynamic load imbalance

Technical Challenge Area #2:Tools • Performance tools, debuggers, compilers, etc. • Evaluate strengths and interactions; ensure adequate installations • Analyze/address gaps in programming environment infrastructure • Provide advanced guidance to RP consultants • Contributors • POC: Mahin Mahmoodi (PSC) • Members: Mahmoodi, Wright, Alameda, Shende, Sheppard, Brown, Nystrom • Current activities • Focus on testing debuggers and performance tools at large core counts • Ongoing, excellent collaboration between SDCI tool projects, plus consideration of complementary tools • Submission for a joint POINT/IPM tools tutorial to TG09 • Installing and evaluating strengths of tools as they apply to complex production applications

Collaborative Performance Engineering Tutorials • TG09: Using Tools to Understand Performance Issues on TeraGrid Machines: IPM and the POINT Project (June 22, 2009) • Karl Fuerlinger (UC Berkeley), David Skinner (NERSC/LBNL), Nick Wright (then SDSC), Rui Liu (NCSA), Allen Malony (Univ. of Oregon), Haihang You (UTK), Nick Nystrom (PSC) • Analysis and optimization of applications on the TeraGrid, focusing on Ranger and Kraken. • SC09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS (Nov. 16, 2009) • Allen Malony and Sameer Shende (Univ. of Oregon), Rick Kufrin (NCSA), Brian Wylie and Felix Wolf (JSC), Andreas Knuepfer and Wolfgang Nagel (TU Dresden), Shirley Moore (UTK), Nick Nystrom (PSC) . • Addresses performance engineering of petascale, scientific applications with TAU, PerfSuite, Scalasca, and Vampir. • Includes hands-on exercises using a Live-DVD containing all of the tools, helping to prepare participants to apply modern methods for locating and diagnosing typical performance bottlenecks in real-world parallel programs at scale.

Technical Challenge Area #3: Workflow, data transport, analysis, visualization, and storage • Coordinating massive simulations, analysis, and visualization • Data movement between RPs involved in complex simulation workflows; staging data from HSM systems across the TeraGrid • Technologies and techniques for in situ visualization and analysis • Contributors • POC: Lonnie Crosby (NICS) • Members: Crosby, Welling, Nystrom • Current activities • Focus on I/O profiling and determining platform-specific recommendations for obtaining good performance for common parallel I/O scenarios

Co-organized a Workshop on Enabling Data-Intensive Computing: from Systems to Applications • July 30-31, 2009, University of Pittsburghhttp://www.cs.pitt.edu/~mhh/workshop09/index.html • 2 days: presentations, breakout discussions • architectures • software frameworks and middleware • algorithms and applications • Speakers • John Abowd - Cornell University • David Andersen - Carnegie Mellon University • MagdaBalazinska - The University of Washington • Roger Barga - Microsoft Research • Scott Brandt - The University of California at Santa Cruz • MootazElnozahy - International Business Machines • Ian Foster - Argonne National labs • Geoffrey Fox - Indiana University • Dave O'Hallaron - Intel Research • Michael Wood-Vasey - University of Pittsburgh • MazinYousif - The University of Arizona • Taieb Znati - The National Science Foundation From R. Kouzes et al., The Changing Paradigm of Data-Intensive Computing, IEEE Computer, January 2009

Next TeraGrid/Blue Waters Extreme-Scale Computing Workshop • To focus on parallel I/O for petascale applications, addressing: • multiple levels of applications, middleware (HDF, MPI-IO, etc.), and systems • requirements for data transfers to/from archives and remote processing and management facilities. • Tentatively scheduled for the week of March 22, 2010, in Austin • Builds on preceding Petascale Application Workshops • December 2007, Tempe: general issues of petascale applications • June 2008, Las Vegas: more general issues of petascale applications • March 2009, Albuquerque: fault tolerance and resilience; included significant participation from NNSA, DOE, and DoD

Extreme Scalability Working Group (XS-WG): Status Update