A New Optimization Technique for the Inspector-Executor Method

A New Optimization Technique for the Inspector-Executor Method Daisuke Yokota† Shigeru Chiba‡ Kozo Itano† † University of Tsukuba ‡Tokyo Institute of Technology

Computer Simulation is Expensive • Physicists are running a parallel computer at our campus every day for simulation. • Our target parallel computer costs$45,000 every month • $1 / min • International phone call between Japan and Canada. • The program runs very long. • A week or more.

Hardware for Fast Inter-node Communication • Our computer SR2201 has such hardwarefor avoiding communication bottleneck • Should be used but not in the real… • At least, at our computer center • It is not used by compiler • Difficult to generate optimized code for that hardware • It is not used by programmer • Programmers are not computer engineers but physicists

Our HPF Compiler • Optimization for • Utilizing hardware for inter-node communication • Technique • The Inspector-Executor methodplus Static Code Optimization • Compilation is executed in parallel • Target • Hitachi SR2201

Optimizations • Reducing the amountof the exchanged data • Our compiler allocates loop iterations to appropriate nodes for minimizing communication • Merging multiple messages • Our target computer provides hardware support • Our compiler tries to use that hardware • Reusing TCW • Another hardware support • To reduce setup time for each message sending

Merging Multiple Messages • Hardware support: • Block-Stride Communication • Multiple messages are sentas a single message (Data must be stored at regular Intervals) Sender Receiver

Reusing TCW • TCW: Transfer Control Word • Reusing parameters to the communication hardware setting do I=1,… end do do I=1,… end do send setting send before optimization after optimization

Implementation:Original Inspector-Executor Method • Goal: Parallelize a loop by runtime analysis • Inspector runs at runtime Inspector Executor Resulting data of the analysis Determines which array elements must be exchanged among nodes • Exchanges array elements • Executes a loop body in parallel • Exchanges array elements

Our ImprovedInspector Executor Method • Inspector produces statically optimized code of the executor. • Inspector runs off-line. • Running Inspector is part of the compilation process. Inspector Executor Optimized executor code - Not data!

Static Code Optimization • Inspector performs constant folding • When generating the executor code • Constant folding eliminates from Executor: • A table containing the result of the analysisby Inspector • Saves memory space (the table size is big!) • Memory access for table-lookup • Better performance

OUTER directive • Specifies the range of analysis by Inspector. • OUTER Loop • We assume that the program structure fits the structure of typical simulation programs. This repeats millions of times during the simulation. OUTER Loop Executor This is parallelized. INNER Loop

Restrictions • Programmers must guarantee … • Every iteration of the OUTER loop needs to exchange the same set of array elements among nodes. • Since Inspector analyzes only the first iteration • The set of exchanged array elements is determined without executing inter-node communication • Inspector does not perform the communication for reducing the compilation time • Our compiler cannot compile IS of NAS parallel benchmark

Our Compiler Runs on a PC Cluster • For executing inspectorin parallel. • Inspector must analyze a largeamount of data. • In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.

〃〃 〃〃〃〃 Execution Flow of Our Compiler Source Program Generate Inspector Generate Inspector Inspector Log Inspector Log Analysis Analysis Exchange Information of Messages Code Generation Code Generation Translate into SPMD SPMD Parallel code

Our Prototype Compiler • Fortran77 + HPF + OUTER directive • Output: SPMD Fortran code • Target machine • Compilation:PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet • Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes

Experiments: Pde1 benchmark • Poisson Equation • Good for massively parallel computing • Regular array accesses • High scalability • Distributed array accesses are centralized in a small region of source code

Execution Time (pde1) 20 15 Ours 249sec Hitachi HPF 10 Linear Speedup 5 137,100sec 0 1 2 4 8 16 Number of nodes Hitachi’s HPF compiler needs more directives for better performance

Effects by static code optimization (pde1) Reduction of execution time Number of nodes

Long compilation time is paid off if the OUTER loop iterates many times. Compilation Time (pde1) 250 200 backend Fortran 150 sequential Compilation time (sec) parallel 100 data exchange 50 0 2 4 8 16 Number of nodes

Experiment: FT-a • 3D Fourier Transformation • Features • Irregular array accesses • Distributed array accesses are centralized in a small region of source code

Execution Time (FT-a) 20 15 Ours Hitachi HPF 46sec 10 Speedup Linear 5 4,898sec 0 1 2 4 8 16 Number of nodes

Compilation Time (FT-a) 350 300 backend 250 sequential 200 Compilation Time (sec) parallel 150 data exchange 100 50 0 2 4 8 16 Number of nodes

Experiments: BT-a • Block Tri-diagonal Solver • Features • A small number of irregular array accesses • Distributed array accesses are scattered all over the source code

Execution Time (BT-a) 20 15 Ours 1,430sec 10 Hitachi HPF Speedup Linear 5 1,370,000sec 0 1 2 4 8 16 Number of nodes

Our compiler cannot achieve good performance Compilation Time (BT-a) 40000 backend 35000 sequential 30000 parallel 25000 Compilation Time (sec) data exchange 20000 15000 10000 5000 0 2 4 8 16 Number of nodes Inspector must analyze a huge number of array accesses

Conclusion • HPF compiler for utilizing hardware for inter-node communication • Inspector-executor method • Static code optimization • Inspector produces optimized executor code • Compiler runs on a PC cluster • Experiment • Long compilation time is acceptable for simulation programs running for long time

予備

通信量の削減(最適化) • 通信量が少なくなるようにループのくり返しを分配 • データの分割はHPFで指示 • 予備実行で発生するであろう通信量を調べる P E 2 必要な通信量 P E 1 ループのくり返し P E 1 P E 1 P E 2 P E 2 受け持つプロセッサ

Merging Multiple Messages • Our compiler collects several messages sent in a single message • Messages in the loop with INDEPENDENT directive can be merged • This directive specifies that the result of that loop is independent of the execution order of the iterations • Our compiler finds block-stride communication to reduce a number of communication by pattern matching

Future Works • We want to reduce a number of communication more • We want to use block stride communication more aggressively (If with sending redundant data they could be merged into small number of communication, ◎) • Prevention of expanding too long code • If data dependency between processors are too complex, our compiler generates too many communication operations • Improvement of scalability of compilation time • Inspector log by BT was too huge • Experiments with real simulations

CP-PACS/Pilot-3 • Distributed memory machine • Center for Computational Physics at University of Tsukuba • 2048PEs(CP-PACS),128PEs(Pilot-3) • Hyper crossbar • RDMA

Our Optimizer to Solve the Problem • Use of special communication devices • Parallel machines sometimes have special hardware to reduce a time for inter-node communication • Development of compilers for easy and well-know computer languages • Fortran77, Simple HPF(High Performance Fortran) • Runtime analysis • Profiler about communication on PC-cluster

Effects by static code optimization (pde1) Reduction of execution time Number of nodes

A New Optimization Technique for the Inspector-Executor Method

A New Optimization Technique for the Inspector-Executor Method

Presentation Transcript

pediatric acl: a new technique

A New Method for the Tele-operation of Aircraft

Pediatric ACL: A New Technique

A New Method For Numerical Constrained Optimization

A New Metaheuristic for Optimization: Optics Inspired Optimization (OIO)

Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions

Approach, method and technique

A new snow forecasting technique

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

A Distributed, Complete Method for Multi-Agent Constraint Optimization

A New Technique for Fitting Colour-Magnitude Diagrams

A Uniform Optimization Technique for Offset Assignment Problems

The Cohort-Component Method A New Method for Household Projections by Tenure

Optimization Mechanics of the Simplex Method

Newton’s Method for One-Dimensional Optimization

A new technique for remote sensing of Trichodesmium

Choosing An Executor

Play way Method-New Learning Technique

Choosing the Right Executor

A New Fuzzing Technique for Software Vulnerability Testing