330 likes | 538 Views
A New Optimization Technique for the Inspector-Executor Method. Daisuke Yokota † Shigeru Chiba ‡ Kozo Itano †. † University of Tsukuba ‡Tokyo Institute of Technology. Computer Simulation is Expensive. Physicists are running a parallel computer at our campus every day for simulation.
E N D
A New Optimization Technique for the Inspector-Executor Method Daisuke Yokota† Shigeru Chiba‡ Kozo Itano† † University of Tsukuba ‡Tokyo Institute of Technology
Computer Simulation is Expensive • Physicists are running a parallel computer at our campus every day for simulation. • Our target parallel computer costs$45,000 every month • $1 / min • International phone call between Japan and Canada. • The program runs very long. • A week or more.
Hardware for Fast Inter-node Communication • Our computer SR2201 has such hardwarefor avoiding communication bottleneck • Should be used but not in the real… • At least, at our computer center • It is not used by compiler • Difficult to generate optimized code for that hardware • It is not used by programmer • Programmers are not computer engineers but physicists
Our HPF Compiler • Optimization for • Utilizing hardware for inter-node communication • Technique • The Inspector-Executor methodplus Static Code Optimization • Compilation is executed in parallel • Target • Hitachi SR2201
Optimizations • Reducing the amountof the exchanged data • Our compiler allocates loop iterations to appropriate nodes for minimizing communication • Merging multiple messages • Our target computer provides hardware support • Our compiler tries to use that hardware • Reusing TCW • Another hardware support • To reduce setup time for each message sending
Merging Multiple Messages • Hardware support: • Block-Stride Communication • Multiple messages are sentas a single message (Data must be stored at regular Intervals) Sender Receiver
Reusing TCW • TCW: Transfer Control Word • Reusing parameters to the communication hardware setting do I=1,… end do do I=1,… end do send setting send before optimization after optimization
Implementation:Original Inspector-Executor Method • Goal: Parallelize a loop by runtime analysis • Inspector runs at runtime Inspector Executor Resulting data of the analysis Determines which array elements must be exchanged among nodes • Exchanges array elements • Executes a loop body in parallel • Exchanges array elements
Our ImprovedInspector Executor Method • Inspector produces statically optimized code of the executor. • Inspector runs off-line. • Running Inspector is part of the compilation process. Inspector Executor Optimized executor code - Not data!
Static Code Optimization • Inspector performs constant folding • When generating the executor code • Constant folding eliminates from Executor: • A table containing the result of the analysisby Inspector • Saves memory space (the table size is big!) • Memory access for table-lookup • Better performance
OUTER directive • Specifies the range of analysis by Inspector. • OUTER Loop • We assume that the program structure fits the structure of typical simulation programs. This repeats millions of times during the simulation. OUTER Loop Executor This is parallelized. INNER Loop
Restrictions • Programmers must guarantee … • Every iteration of the OUTER loop needs to exchange the same set of array elements among nodes. • Since Inspector analyzes only the first iteration • The set of exchanged array elements is determined without executing inter-node communication • Inspector does not perform the communication for reducing the compilation time • Our compiler cannot compile IS of NAS parallel benchmark
Our Compiler Runs on a PC Cluster • For executing inspectorin parallel. • Inspector must analyze a largeamount of data. • In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.
〃 〃 〃 〃 〃 〃 Execution Flow of Our Compiler Source Program Generate Inspector Generate Inspector Inspector Log Inspector Log Analysis Analysis Exchange Information of Messages Code Generation Code Generation Translate into SPMD SPMD Parallel code
Our Prototype Compiler • Fortran77 + HPF + OUTER directive • Output: SPMD Fortran code • Target machine • Compilation:PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet • Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes
Experiments: Pde1 benchmark • Poisson Equation • Good for massively parallel computing • Regular array accesses • High scalability • Distributed array accesses are centralized in a small region of source code
Execution Time (pde1) 20 15 Ours 249sec Hitachi HPF 10 Linear Speedup 5 137,100sec 0 1 2 4 8 16 Number of nodes Hitachi’s HPF compiler needs more directives for better performance
Effects by static code optimization (pde1) Reduction of execution time Number of nodes
Long compilation time is paid off if the OUTER loop iterates many times. Compilation Time (pde1) 250 200 backend Fortran 150 sequential Compilation time (sec) parallel 100 data exchange 50 0 2 4 8 16 Number of nodes
Experiment: FT-a • 3D Fourier Transformation • Features • Irregular array accesses • Distributed array accesses are centralized in a small region of source code
Execution Time (FT-a) 20 15 Ours Hitachi HPF 46sec 10 Speedup Linear 5 4,898sec 0 1 2 4 8 16 Number of nodes
Compilation Time (FT-a) 350 300 backend 250 sequential 200 Compilation Time (sec) parallel 150 data exchange 100 50 0 2 4 8 16 Number of nodes
Experiments: BT-a • Block Tri-diagonal Solver • Features • A small number of irregular array accesses • Distributed array accesses are scattered all over the source code
Execution Time (BT-a) 20 15 Ours 1,430sec 10 Hitachi HPF Speedup Linear 5 1,370,000sec 0 1 2 4 8 16 Number of nodes
Our compiler cannot achieve good performance Compilation Time (BT-a) 40000 backend 35000 sequential 30000 parallel 25000 Compilation Time (sec) data exchange 20000 15000 10000 5000 0 2 4 8 16 Number of nodes Inspector must analyze a huge number of array accesses
Conclusion • HPF compiler for utilizing hardware for inter-node communication • Inspector-executor method • Static code optimization • Inspector produces optimized executor code • Compiler runs on a PC cluster • Experiment • Long compilation time is acceptable for simulation programs running for long time
通信量の削減(最適化) • 通信量が少なくなるようにループのくり返しを分配 • データの分割はHPFで指示 • 予備実行で発生するであろう通信量を調べる P E 2 必要な通信量 P E 1 ループのくり返し P E 1 P E 1 P E 2 P E 2 受け持つプロセッサ
Merging Multiple Messages • Our compiler collects several messages sent in a single message • Messages in the loop with INDEPENDENT directive can be merged • This directive specifies that the result of that loop is independent of the execution order of the iterations • Our compiler finds block-stride communication to reduce a number of communication by pattern matching
Future Works • We want to reduce a number of communication more • We want to use block stride communication more aggressively (If with sending redundant data they could be merged into small number of communication, ◎) • Prevention of expanding too long code • If data dependency between processors are too complex, our compiler generates too many communication operations • Improvement of scalability of compilation time • Inspector log by BT was too huge • Experiments with real simulations
CP-PACS/Pilot-3 • Distributed memory machine • Center for Computational Physics at University of Tsukuba • 2048PEs(CP-PACS),128PEs(Pilot-3) • Hyper crossbar • RDMA
Our Optimizer to Solve the Problem • Use of special communication devices • Parallel machines sometimes have special hardware to reduce a time for inter-node communication • Development of compilers for easy and well-know computer languages • Fortran77, Simple HPF(High Performance Fortran) • Runtime analysis • Profiler about communication on PC-cluster
Effects by static code optimization (pde1) Reduction of execution time Number of nodes