530 likes | 645 Views
Department of Defense High Performance Computing Modernization Program. Making HPC System Acquisition Decisions Is an HPC Application. Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office
E N D
Department of DefenseHigh Performance Computing Modernization Program Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office Roy L. Campbell, Jr., U.S. Army Research Laboratory William Ward, U.S. Army Engineer Research and Development Center Allan Snavely and Laura Carrington, University of California at San Diego November 2004
Overview • Program Background • Acquisition Methodology • Process • Benchmarks • Performance and price/performance scoring • System selection and optimization of program workload • Uncertainty analysis • Performance Prediction and Analysis
An Exciting Time for Benchmarking and Performance Modeling! • DOE PERC program • DoD benchmarking and performance modeling activities • DARPA HPCS Productivity Team benchmarking activities • HEC Revitalization Task Force Report • Joint Federal Agency benchmarking and acquisitions • Federal large benchmarking study • Federal Agencies, HPC User Forum, IDC
As of November 2004 Current User Base and Requirements CSM – 507 Users CWO – 231 Users • 561 projects and 4,572 users at approximately 179 sites • Requirements categorized in 10 Computational Technology Areas (CTA) • FY 2005 non-real-time requirements of 260 teraFLOPS-years FMS – 889 Users CFD – 1,135 Users CEN – 38 Users CCM – 235 Users EQM – 170 Users IMT – 568 Users CEA – 304 Users SIP – 435 Users 60 users are self characterized as “other”
Technology Insertion (TI) HPC System Acquisition Process • Annual process to purchase high performance computing capability for major shared resource centers (MSRCs) and allocated distributed centers • Total funding of $35M–$60M (~$50M in FY 2005) • Two of the four major shared resource centers provisioned each year on a rotating basis • TI-04 process upgraded HPC capabilities at the Army Research Laboratory and the Naval Oceanographic Office MSRCs • TI-05 process will upgrade HPC capabilities at Aeronautical Systems Center and Engineer Research and Development Center MSRCs
Technology Insertion 2005 (TI-05) Acquisition Process • Assess computational requirements • Determine application benchmarks and their weights • Develop acquisition process and evaluation criteria using GSA as acquisition agent • Execute Phase I RFQ and evaluation • Identification of promising HPC systems • Execute Phase II RFQ and evaluation • Construct best solution sets of systems • Purchase best overall solution set through GSA
TI-05 Evaluation Criteria • Performance • Price/Performance • Raw Performance • Usability • User Criteria • Center Criteria • Confidence/Past Performance • Benchmarks (Subset) Quantitative Qualitative
Types of Benchmark Codes • Synthetic codes • Basic hardware and system performance tests • Meant to determine expected future performance and serve as surrogate for workload not represented by application codes • Scalable, quantitative synthetic tests are used for evaluation by the Performance Team, and others are used as system performance checks and qualitative evaluation by Usability Team • A subset of synthetic tests needed for performance modeling is required • Application codes • Actual application codes as determined by requirements and usage • Meant to indicate current performance • Each application code (except two) has two test cases: standard and large
TI-05 Synthetic Benchmark Codes • I/O Tests • Include a simplified streaming test • Include a scalable I/O test • Operating System Tests • Measure the performance of system calls, interprocessor communication, and TCP scalability (now includes IPv4 and IPv6) • Memory Tests • Measure memory hierarchy performance, such as memory bandwidth (now includes multiple memory performance curves based on fraction of random strides in memory access) • Network Tests • Are a set of five MPI tests (point-to-point, broadcast, allreduce) • CPU Tests • Exercise multiple fundamental computation kernels, BLAS routines, and ScaLapack routines • PMaC Machine Probes • Exercise basic system functions to use in performance predictions (included in memory tests, network tests, and streaming I/O test)
TI-05 Application Benchmark Codes • Aero – Aeroelasticity CFD code (Fortran, serial vector, 15,000 lines of code) • AVUS (Cobalt-60) – Turbulent flow CFD code (Fortran, MPI, 19,000 lines of code) • GAMESS – Quantum chemistry code (Fortran, MPI, 330,000 lines of code) • HYCOM – Ocean circulation modeling code (Fortran, MPI, 31,000 lines of code) • OOCore – Out-of-core solver (Fortran, MPI, 39,000 lines of code) • RFCTH2 – Shock physics code (~43% Fortran/~57% C, MPI, 436,000 lines of code) • WRF – Multi-Agency mesoscale atmospheric modeling code (Fortran and C, MPI, 100,000 lines of code) • Overflow-2 – CFD code originally developed by NASA (Fortran 90, MPI, 83,000 lines of code)
Basic Rules for Application Benchmarks: Emphasis on Performance • Establish a DoD standard benchmark time for each application benchmark case • NAVO IBM Regatta P4 chosen as standard DoD system • Benchmark timings (at least three on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four) • Benchmark timings may be extrapolated provided they are guaranteed, but at least one actual timing on the offered or closely related system must be provided
Benchmark Scoring • Two major components of benchmark scoring: application codes and synthetic codes • All application codes need not be run, but the more that are run by any vendor increases the opportunity to be part of the final mix • Evaluation of synthetic tests that are quantitatively scored are done in a consistent fashion with the application tests • Vendors are required to run a load mix test in response to the Phase II RFQ • Weight for application code scores is greater than weight for synthetic codes scores in determining price/performance score • It is essential that results be provided on all required synthetic tests and very important on other tests
Determine workload percentages by CTA Consider all alternatives that meet total cost constraints Partition CTA percentages among benchmark test cases Using benchmark scores, maximize workload for each alternative, subject to the constraint of matching required CTA percentages Determine price/performance score for each alternative and rank order Use of Benchmark Data to Score Systems and Construct Alternatives
= number of application test cases not included (out of 13 total) n HPCMP System Performance (Unclassified) ~40 2 1 3 4 1 9 5
How the Optimizer Works: Problem Description Optimal Quantity Set Workload Distribution Matrix Prices Application Score Matrix Application Test Cases . . . s s s s s $ Application Test Cases . . . s s s s s . . . $ N . . . . . . . . . p p p p p . . . N . . . . . . . . . Machines Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . Machines Machines . . . $ s s s s s . . . . . . . . . . . . $ s s s s s p p . . . p p p N p p . . . p p p Budget Limits Overall Desired Workload Distribution N . . . f f f f f $$ HI Application Test Cases $$ LO Allowed Distribution Deviation KNOWN UNKNOWN
Problem Description • Offered Systems • Quantity is variable • Workload allocation is variable • Existing systems • Quantity is fixed • Workload allocation is variable
Motivation • Primary Goal: Find solution set with optimal (minimum) price/performance (and solution sets w/ price/performance within X% of optimal price/performance) • Secondary Goal: Determine optimal allocation of work for each application test case per machine
Optimization Scheme • Fix quantity of each machine • Mark quantity combinations that fall within acquisition price range (viable options) • Score each viable option (via SIMPLEX optimization kernel) • Divide life-cycle cost (acquisition price, maintenance, power, any costs over and above normal operations) by total performance • Rank results in ascending order
Performance Modeling Uncertainty Analysis • Assumption: Uncertainties in measured performance values can be treated as uncertainties in measurements of physical quantities • For small, random uncertainties in measured values x, y, z, …, the uncertainty in a calculated function q (x, y, z …) can be expressed as: • Systematic errors need careful consideration since they cannot be calculated analytically
Benchmarking and Performance Prediction Uncertainty Analysis • Overall goal: Understand and accurately estimate uncertainties in benchmarking and performance prediction calculations • Develop uncertainty equations from analytical expressions used to calculate performance and price/performance • Estimate uncertainties in quantities used for these calculations • Eventual goal: propagate uncertainties in performance predictions and benchmarking results to determine uncertainties in acquisition scoring
Benchmark Times Benchmark Performance Benchmark Scores Average Performance for Each System Total Performance for Solution Set Rank Ordering and Histograms of Solution Sets Price/Performance for Solution Set Propagation of Uncertainties in Benchmarking and Performance Modeling Power Law Least Squares Fit Benchmark Weights Optimizer Averaging Price/Performance over spans of Solution Sets
Benchmark Times Benchmark Performance Uncertainties in Benchmark Times and Performance From replicated measurements or Analytical performance prediction equation
Uncertainties in Performancevia Power Law Fit Standard performance Data points ln performance Number of processors required to reach standard performance (nSTD) Power Law Fit ln [number of processors (n)]
Benchmark Times Benchmark Performance Benchmark Scores Average Performance for Each System Total Performance for Solution Set Rank Ordering and Histograms of Solution Sets Price/Performance for Solution Set Propagation of Uncertainties in Benchmarking and Performance Modeling Power Law Least Squares Fit 4–5% 4–5% 2–5% Benchmark Weights Optimizer ~4% Averaging Price/Performance over spans of Solution Sets ~3% 2–5%
Architecture % Selection by Processor Quantity for Varying Percentages Off the Best Price/Performance (Example)
Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) 1Assigns an 8% uncertainty in life-cycle cost
Performance Modeling and Prediction Goals • Enable informed purchasing decisions in support of TI-XX • Develop an understanding of our key application codes for the purpose of guiding code developers and users toward more efficient applications (Where are the code/system bottlenecks?) • Replace the current application benchmark suite with a judicious choice of synthetic benchmarks that could be used to predict performance of any HPC architecture on the program’s key applications
Today Dedicated Applications Larger weight Real codes Representative data sets Synthetic Benchmarks Smaller weight Future look Focus on key machine features Tomorrow Synthetic Benchmarks 100% weight Coordinated to application “signature” Performance on real codes accurately predicted from synthetic benchmark results Supported by genuine “signature” databases Benchmarks Next 1–2 years key — must prove that synthetics benchmarks and application “signatures” can be coordinated
Potential Future Impact of Performance Modeling and Prediction Benchmarking Has Real Impact • Over $160M in decisions over last 4 years • $100s of millions in decisions over the next decade Synthetics performance coordinated to application signatures is the next huge step. Make it Happen!
The Performance Prediction Framework • Parallel performance - two major factors: • Single processor performance • Interprocessor communications performance • Two major components of the framework: • Single processor model Model of application’s performance between communication events (floating point performance and memory access) • Communications model (Network simulator) Model of application’s communication events (Measures full MPI latency and bandwidth)
The Performance Prediction Framework • Both models based on simplicity and isolation: • Simplicity: start simple and only add complexity when needed to explain behavior • Isolation: Collect each piece of the performance framework in isolation, then combine pieces for performance prediction
Components of Performance Prediction Framework • Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application • Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine Combine Machine Profile and Application Signature using: • Convolution Method - algebraic mapping of the application signature onto the machine profile to calculate a performance prediction
Components of Performance Prediction Framework Parallel Processor Prediction Single-Processor Model Communication Model Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Application Signature (Application B) Characterization of network operations needed to be performed by Application B Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B Machine A Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B Machine A Performance prediction of Application B on Machine A
MAPS Data Stride-one access L1 cache Random access L1/L2 cache Stride-one access L1/L2 cache Stride-one access L2 cache/Main Memory MAPS – Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache (L1, L2, L3, Main Memory) and different access patterns (stride-one and random)
Application Signature • Trace of operations on the processor performed by an application (memory and FP ops on processor) Sample: Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride • Trace of application is collected and processed by the MetaSim Tracer
Convolutions MetaSim trace collected on Cobalt60 simulating SC45 memory structure Single-processor or per-processor performance: • Machine profile for processor (Machine A) • Application Signature for application (App. #1) • The relative “per-processor” performance of App. #1 on Machine A is represented as the
Results-Predictions for AVUS (Cobalt60) AVUS TI-05 standard data set on 64 CPUs
Results-Predictions for AVUS (Cobalt60) AVUS TI-05 standard data set on 32 CPUs
Results-Predictions for AVUS (Cobalt60) AVUS TI-05 standard data set on 128 CPUs
Results-Predictions for HYCOM HYCOM TI-05 standard data set on 59 CPUs
Results-Predictions for HYCOM HYCOM TI-05 standard data set on 96 CPUs
Results-Predictions for HYCOM HYCOM TI-05 standard data set on 124 CPUs
Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) 1Assigns an 8% uncertainty in life-cycle cost
Results — Sensitivity Study of HYCOMInvestigation of “Processor” Performance Effects • Base case is performance of HABU ( IBM PWR3) • Four-fold improvements in floating-point performance (no impact on run-time!) • Two-fold improvements in memory bandwidth/latency (increase in main memory performance drives improved performance!) • HYCOM run on 59 CPUs • TI-04 Standard data set