100 likes | 215 Views
The Case for Monitoring and Testing. David Montoya CScADS July 15, 2013. LA-UR-13-25132. From a Production Computing Perspective. Where do traditional performance analysis tools fit in the process and what is the usage model? Low use / usage entry cost / skill required
E N D
The Case for Monitoring and Testing David Montoya CScADS July 15, 2013 LA-UR-13-25132
From a Production Computing Perspective Where do traditional performance analysis tools fit in the process and what is the usage model? • Low use / usage entry cost / skill required What is the usage model that will increase awareness and both increase application and drive environment efficiency? • Monitor health of both applications and system resources • Baseline and track • Proper balance of tools to track and probe
Target Usage – Monitoring and Testing User • Understand how applications are utilizing platform resources • Diagnose problems • Adjust mapping of processes onto resources to optimize for: minimum resource use, minimum power consumption, shortest run-time System/Software Administrators • Diagnose problems / Discover root causes • Ensure health and balance of the system • Mitigate effects of errors • Develop better utilization policies for all resources System Architects • Develop a deep understanding of interactions between system components (hardware, firmware, system software, application) • Develop new architectural features to address current shortcomings
Current State of Affairs • No longer enough to analyze the performance of the application. There is a wide rage of node/processor architectures that are evolving that force closer assessment to the environment. • Increasing scale of resources and compute environment, machine failure rates come to the forefront – MTTF / MTTI • New resources such as burst buffers, file system architectures, IO approaches(PLFS), and tools, programming models, etc…. That impact resource utilization and performance. • Issues such as power management having larger impact
Moving toward tighter integration • As scale increases, the computing architecture becomes more integrated with sub-systems to provide services. Distributed approaches for those services are evolving. • Additional run-time systems that are more tightly integrated are evolving. • We have come full circle to where the compute environments are no longer individual components or systems that are loosely coupled but architected systems that need to behave in a more holistic manner. • A focus of the HPC performance analysis capability needs to move from application performance to its ability to perform in a given computing environment – and the environment’s performance. • This is a move for balance, resource utilization and targets application flexibility.
The current tool box and evolution • Typical monitoring systems target failure detection, uptime, and resource state/trend overview: • Information targeted to system administration • Collection intervals of minutes • Relatively high overhead (both compute node and aggregators) • Application profiling/debugging/tracing tools: • Collection intervals of sub seconds (even sub-millisecond) • Typically requires linking (i.e. tools may perturb the application profile) • Limits on scale • Don’t account for external applications competing for the same resource • (monitoring tool example) -Lightweight Distributed Metric Service (LDMS): • Continuous data collection, transport, storage as a system service • Targets system administrators, users, and applications • Enables collection of a reasonably large number of metrics with collection periods that enable job-centric resource utilization analysis and run-time anomaly detection • Variable collection period (~seconds) • On-node interface to run-time data
How do you move forward? Data and integration.. • You need to understand the health of the system, where there is stress, tie it back to application behavior. • Aspects of traditional application analysis but includes system monitoring of all key subsystems with the ability to assess the impact of the application behavior and resource interaction. • integration of the data to provide assessment of the application, the various subsystems, and then the ability to apply solutions to better balance, enact efficiencies, establish throughput.. Monitoring and Testing • Collect system and subsystem data. • network, file systems, compute nodes, resource manager data, etc.. • Currently collaborating with monitoring tools development (SNL, others). Taking inventory via Monitoring and Testing Summit.
LANL Monitoring and Testing Summit Monitoring / Testing Frameworks • Splunk • Zenoss • RabbitMQ • LDMS framework • Monitoring Infrastructure • OVIS – HPC system analysis • Gazebo Testing Framework • CTS Testing Framework Application: • MTT OpenMPI - testing • Darshan IO analysis • EAP and LAP dashboards • ByFL Network: • IB Performance monitoring • IB Monitoring • IB Error monitoring • ibperf_seq, ibperf_ring, ibperf_agg, mpiring • IDS Project (security) • Network Monitoring in Splunk • DISCOM TestingTrilab Data Transfer Cat 2 function/Performance testing
LANL Monitoring and Testing Summit – cont. FileSystems: • File Systems Monitoring in Splunk • New SystemIntegrationtesting • FStools, filesystemtoolsfortheusers • Filesystemtreewalk • FileSystemHealthCheck • Splunk FTA monitoring • PLFS Regression and Performance Testing • PanfsReleaseFileSystemTesting and Analysis • HPSS Monitoring Cluster/Node • Baler -- Log file analysis tool – • LDMS node collection • Automatic Library Tracking Database (ALTD) • General software usage tracking • Cielo DRAM and SRAM monitoring – • HPCSTATs (Reporting more than monitoring) Moab Logs • CBTF based GPU/Nvidia monitoring • GPU/Cluster Testing • SecMon / Security Monitoring via Zenoss • Splunk Cluster Testing • New System Integration • Post DST /Utilization testing • Software testing
Next Steps • Assess efforts, integrate • Assess data, integrate • Assess information view to target users, integrate • Start over