Extreme Scalability Working Group (XS-WG): Status Update

Extreme Scalability Working Group (XS-WG):Status Update Nick Nystrom Director, Strategic Applications Pittsburgh Supercomputing CenterMay 20, 2010

Extreme Scalability Working Group (XS-WG): Purpose • Meet the challenges and opportunities of deploying extreme-scale resources into the TeraGrid, maximizing both scientific outputand user productivity. • Aggregate, develop, and share wisdom • Identify and address needs that are common to multiple sites and projects • May require assembling teams and obtaining support for sustained effort • XS-WG benefits from active involvement of all Track 2 sites, BlueWaters, tool developers, andusers. • The XS-WG leverages and combines RPs’ interests to deliver greater value to the computational science community.

XS-WG Participants • Amit Majumdar SDSC, TG AUS AD • Mahin Mahmoodi PSC, Tools lead • Allen Malony Univ. of Oregon(P) • David O’Neal PSC • Dmitry Pekurovsky SDSC • Wayne Pfeiffer SDSC • Raghu Reddy PSC, Scalability lead • Sergiu Sanielevici PSC • Sameer Shende Univ. of Oregon(P) • Ray Sheppard IU • Alan Snavely SDSC • Henry Tufo NCAR • George Turner IU • John Urbanic PSC • Joel Welling PSC • Nick Wright NERSC(P) • S. Levent Yilmaz CSM, U. Pittsburgh(P) • Nick Nystrom PSC, XS-WG lead • Jay Alameda NCSA • Martin Berzins Univ. of Utah(U) • Paul Brown IU • Lonnie Crosby NICS, IO/Workflows lead • Tim Dudek GIG EOT • Victor Eijkhout TACC • Jeff Gardner U. Washington(U) • Chris Hempel TACC • Ken Jansen RPI(U) • Shantenu Jha LONI • Nick Karonis NIU(G) • Dan Katz U. of Chicago • Ricky Kendall ORNL • Byoung-Do Kim TACC • Scott Lathrop GIG, EOT AD • Vickie Lynch ORNL U: user; P: performance tool developer; G: grid infrastructure developer; *: joined XS-WG since last TG-ARCH update

Technical Challenge Area #1:Scalability and Architecture • Algorithms, numerical methods, multicore performance, etc. • Robust, scalable infrastructure (libraries, frameworks, languages) for supporting applications that scale to O(104–6) cores • Numerical stability and convergence issues that emerge at scale • Exploiting systems’ architectural strengths • Fault tolerance and resilience • Contributors • POC: Raghu Reddy (PSC) • Recent and ongoing activities: hybrid performance • Raghu submitted a technical paper to TG10 with Annick Pouquet • Synergy with AUS; work by Wayne Pfeiffer and Dmitry Pekurovsky • Emphasis on documenting & disseminating guidance • Raghu’s work on the HOMB benchmark, Pfeiffer, Pekurovsky, others

Technical Challenge Area #2:Tools • Performance tools, debuggers, compilers, etc. • Evaluate strengths and interactions; ensure adequate installations • Analyze/address gaps in programming environment infrastructure • Provide advanced guidance to RP consultants • Contributors • POC: Mahin Mahmoodi (PSC) • Recent and ongoing activities: reliable tool installations • Nick and Mahin visited NICS in December to give a seminar on performance engineering and tool use • Mahin and NICS staff developed efficient, sustainable proceduresand policies for keeping tool installations up to date and functional • Ongoing application of performance tools at scale to complex applications to ensure their correct functionality; identify & remove problems • Nick, Sameer, Riu Liu, and Dave Cronk co-presented a performance engineering tutorial at LCI10 (March 8, 2010, Pittsburgh)

Collaborative Performance Engineering Tutorials • SC09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS (November 16, 2009) • Allen Malony and Sameer Shende (Univ. of Oregon), Rick Kufrin (NCSA),Brian Wylie and Felix Wolf (JSC), Andreas Knuepfer andWolfgang Nagel (TU Dresden), Shirley Moore (UTK), Nick Nystrom (PSC) • Addresses performance engineering of petascale, scientific applications with TAU, PerfSuite, Scalasca, and Vampir • Includes hands-on exercises using a Live-DVD containing all of the tools, helping to prepare participants to apply modern methods for locating and diagnosing typical performance bottlenecks in real-world parallel programs at scale • LCI10: Using POINT Performance Tools: TAU, PerfSuite, PAPI, Scalasca, and Vampir (March 8, 2010) • Sameer Shende (Univ. of Oregon), David Cronk (Univ. of Tennessee at Knoxville), Nick Nystrom (PSC), and Rui Liu (NCSA) • Targeted multicore performance issues

Technical Challenge Area #3: Workflow, data transport, analysis, visualization, and storage • Coordinating massive simulations, analysis, and visualization • Data movement between RPs involved in complex simulation workflows; staging data from HSM systems across the TeraGrid • Technologies and techniques for in situ visualization and analysis • Contributors • POC: Lonnie Crosby (NICS) • Current activities • Extreme Scale I/O and Data Analysis Workshop

Extreme Scale I/O and Data Analysis Workshop • March 22-24, 2010, Austin • http://www.tacc.utexas.edu/petascale-workshop/ • Sponsored by the Blue Waters Project, TeraGrid, and TACC • Builds on preceding Petascale Application Workshops • December 2007, Tempe and June 2008, Las Vegas: petascale applications • March 2009, Albuquerque: fault tolerance and resilience; included significant participation from NNSA, DOE, and DoD • 48 participants from 30 institutions • 2 days: presentations + lively discussion • application requirements; filesystems; I/O libraries and middleware; large-scale data management

Extreme Scale I/O and Data Analysis Workshop:Some Observations & Findings • Users are doing parallel I/O using a variety of means • Rolling their own, HDF, netCDF, MPI-IO, ADIOS, …: no one size fits all • Data volumes can exceed the capability of analysis resources • E.g. ~0.5-1.0 TB per wall clock day for certain climate simulations • The greatest complaint was large variability in I/O performance • 2-10× slowdown cited as common; 300× observed • The causes are well understood. How to avoid them is not. • Potential research direction: Extensions to schedulers to support file information from jobs being submitted plus detailed knowledge of parallel filesystem characteristics might enable I/O quality of service and allow effective workload optimization.

Questions?

Extreme Scalability Working Group (XS-WG): Status Update