220 likes | 361 Views
Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis.
E N D
Using XDMoD to Facilitate XSEDEOperations, Planning and Analysis Thomas R. Furlani1, Barry I. Schneider2, Matthew D. Jones1, John Towns3, David L. Hart4, Steven M. Gallo1, Robert L. DeLeon1, Charng-Da Lu1, Amin Ghadersohi1, Ryan J. Gentner1, Abani K. Patra5, Gregorvon Laszewski6, Fugang Wang6, Jeffrey T. Palmer1, Nikolay Simakov1 1Center for Computational Research, University at Buffalo, SUNY, 2 CISE - Advanaced Computing Infrastructure, National Science Foundation, 3NCSA - University of Illinois, 4National Center for Atmospheric Research, 5Mech. & Aerospace. Eng. Dept. University at Buffalo, SUNY, 6Pervasive Technology Institute - University of Indiana Tom Furlani, PhD Director - Center for Computational Research University at Buffalo, SUNYXSEDE13 JULY 22 – 25, 2013
Outline • Overview of Technology Audit Service (XDMoD) • XDMoD Case Studies • Data Driven CI Planning for XSEDE • System Operation and Maintenance • Interpreting XDMoD Data • Future XDMoD Functionality • SUPReMM (Lightning Talk – Wed, 3PM, Marina Ballroom F&G) • PEAK (NICS) (Optimizing Utilization Across XSEDE – Thurs, 8:30AM, Marina Ballroom G) • Scientific Impact and Open Source Version (XDMoD TAS BOF – Wed, 6PM, Palomar)
CoAuthors • Barry I. Schneider(NSF) • Matthew D. Jones (UB) • John Towns (NCSA) • David L. Hart (NCAR) • Steven M. Gallo (UB) • Robert L. DeLeon (UB) • Charng-Da Lu • Amin Ghadersohi (UB) • Ryan J. Gentner (UB) • AbaniK. Patra (UB) • Gregorvon Laszewski (Indiana) • Fugang Wang (Indiana) • Jeffrey T. Palmer (UB) • NikolaySimakov (UB)
Motivation Example: Log File Analysis Discovers Two Malfunctioning Nodes • Measuring utilization of CI provides an understanding of how resource is being utilized • HPC systems are a complex combination of software, processors, memory, networks, and storage systems - difficult to know if optimal performance is being realized, or even if all subcomponents are functioning properly
XSEDE Technology Audit Service (TAS) • Provide Auditing and Quality of Service (QoS) Metrics • Primary components to TAS • XDMoD: XSEDE Metrics on Demand Portal • Analytics Framework for XSEDE • Display results of all metrics (utilization, wait time, etc ) • Easy to use • Application Kernel Framework • Measure performance of XSEDE infrastructure • Diagnostic set of tools – early identification of system problems • Broader Impact • Open source framework for academic HPC centers • Organizations • Buffalo, Indiana (Laszewski), Michigan (Finholt), UT-NICS (You)
XDMoD: XD Metrics on Demand Portal • Display metrics, Role Based, Custom Report Builder
XDMoD Case Studies • Data Driven CI Planning for XSEDE • System Operation and Maintenance • Interpreting XDMoD Data
Data Driven CI Planning for XSEDE • Largest, average and total SU allocations on XSEDE over time. Average and largest allocations have increased by more than a factor of 10 over the time period
Data Driven CI Planning for XSEDE • Total service unit usage by parent science- Molecular Bioscience usage has grown over time – now rivals that of Physics
Data Driven CI Planning for XSEDE • However average core count varies widely over parent science – molecular bioscience jobs tend to use a relatively small number of processors
CI System Operation and Maintenance • Application kernels help detect user environment anomaly at CCR • Example: Performance variation of NWChemdue to bug in commercial parallel file system that was subsequently fixed by vendor
CI System Operation and Maintenance • Sudden decrease in file system performance on TACC Lonestar4 as measured by 3 different application kernels (IOR, MPI-Tile-IO, and IMB)
CI System Operation and Maintenance • Application kernel control process to automatically detect underperforming application kernels (poor performance). Red zone indicates an application kernel that is underperforming
Interpreting XDMoD Data • Like any analysis system, care must be exercised in interpretation of data from XDMoD • Ex. Distribution of job sizes for all parent science Physics jobs in XSEDE resources for the period 2008-2012
Interpreting XDMoD Data • Mean core count for Physics jobs in XSEDE resources for the period 2008-2012, including (blue line) and excluding (red line) serial runs Number of Serial Physics Jobs by Resource High Throughput Jobs Start at Purdue
Future XDMoD Functionality: SUPReMM • SUPReMM (Lightning Talk – Wed, 3PM) • Collaboration with TACC and U Texas at Austin • Comprehensive job level resource use measurement for large clusters • Will supply XDMoD with some missing job usage data – application run, memory, local I/O, network, file-system, and CPU usage • Sample application report for Lonestar4
Future XDMoD Functionality: PEAK • NICS – PEAK (Thursday, 8:30AM) • Optimizing Utilization Across XSEDE (Dr. Haihang You) • Performance Environment AutoconfigurationFrameworK • UT-NICS project to automatically tune key libraries and application kernels • Ex. Performance of Amber on Kraken – Amber built with PGI much faster
Future XDMoDFunctionalityOpen Source XDMoD & Scientific Impact • Open Source Version: (XDMoD BOF - Wed, 6PM) • XDMoD functionality for non-XSEDE HPC centers • Installation by system administrators • Programming not required • Guided textual installation process • Installation support provided by TAS Team • Pre-existing central database not required • Aggregate data from available sources • Resource manager log files or existing database • Currently recruiting for beta-testing program • Scientific Impact • Preliminary XSEDE-based H-Index
Acknowledgement • This work was sponsored by NSF under grant number OCI 1025159 for the development of Technology Audit Service for XSEDE. • Contact Info • furlani@buffalo.edu • XDMoDhttps://xdmod.ccr.buffalo.edu/ • xdmod-support@ccr.buffalo.edu