250 likes | 396 Views
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers. K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST. Background. “Peta” is tremendous! Compared with “Giga or Tera” scale machines.
E N D
PSI-SIM: System Performance Evaluation Environment forNext-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST
Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?
Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines • If you would like to develop a “Peta-Scale” supercomputer, it is required to… • Explore the design space bothof computation nodes and inter-connection network! • Verify the effective performance to be achieved! • So, we need a performance evaluation environment for peta-scale supercomputers!
Our Goal! • Problem… • Simulations are 3-orders of magnitude slower than real machines! • “Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! • How can we bridge the gap? • Develop an efficient performance evaluation environment: PSI-SIM • Divide compute-node simulations and network simulations! • Abstract the target application program to accelerate simulation speed!
Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization
Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization
What is the Skeleton Code? Original code Skeleton code foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } • Computation blocks are replaced by “Estimated” execution times! • Other modifications (e.g. reducing required memory size)
Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization
Generating Communication Profile • BSIM-Logger • Executes the skeleton code on an existing machine • Emulates the behavior of target machine • Generates a communication profile under the assumption of a ZERO-latency ideal network • Why Fast? • Abstracted computation blocks are NOT executed (just update virtual timers) • Mask real communications, but generate accurate logs
How Fast? How Accurate? ERI (Electron Repulsion Integral) Skeleton Original Original Time for logging (s) Exe. Time Predicted (s) Skeleton NAS PARALLEL FT Original Original Skeleton Time for logging (s) Exe. Time Predicted (s) Skeleton
Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization
Fast, Flexible Interconnection Network Simulator • NSIM • Inputs the communication profile and a network configuration file • Generates a communication profile with estimated interconnect latency • Why Fast? Why Flexible? • Parallelized implementation • Support a number of parameters • Topology , Spec. of routers/switches, buffer size, and so on
Performance of BSIM + NSIM Measured Predicted • Performance prediction for HPL execution @16nodes PC cluster • <120s (problem size = 5,000) @8CPU • About 9,000 MPI-Comm./s@8CPU Execution Time (s) Error=5.3% Not skeleton execution
Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization
ANAGroupWork Viewer • Performance Indicator • Execution time after load-balance optimization • Group Work • Indicate load balance • Communication Indicator • Amount of communications per second
Conclusions • PSI-SIM • Performance evaluation environment for supercomputers • BSIM+NSIM+ANA • On Going Work: Performance Prediction for • “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32CPU Cores) • “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine
Peta-scale Performance Prediction • Assumption • HPL problem size: 3Million • #of nodes: 4K (PSI-SIMD) • BSIM: use 32 cpus (3GHz Xeon) • NSIM: 10,000 MPI-Comm./s@8CPU • How long we need to spend? • BSIM: about 300h (<2 weeks) • NSIM: about ?? • under the estimation…
予測実行時間(FT) Target machine?: rscc Used machine?: rscc 誤差 -11.3% 誤差 -11.6%
通信プロファイル時間(FT) Target machine?: rscc Used machine?: rscc 19%削減 86%削減
予測実行時間(ERI) Target machine?: rscc Used machine?: rscc 誤差 -0.6% 誤差 1.5% 誤差 -0.2%
通信プロファイル生成時間(ERI) Target machine?: rscc Used machine?: rscc 97%削減 96%削減 91%削減
実行時間の予測性能 通信レイテンシ 予測精度:94.7% 評価アプリケーションの規模増加 ⇒ 予測精度が向上
シミュレーション時間(問題サイズ固定:2000)シミュレーション時間(問題サイズ固定:2000) 1,024プロセス 最近の成果(高速化)分 256プロセス 16プロセス 評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上
Performance of NSIM 7.92,8.36,8.04 Accuracy:94.7% Target machine?:PSI-hexa Used machine?: PSI-hexa 114s