1 / 26

Typical performance bottlenecks and how they can be found

Typical performance bottlenecks and how they can be found. Bert Wesarg ZIH , Technische Universität Dresden. Outline. Case I: Finding load imbalances in OpenMP codes Case II: Finding communication and computation imbalances in MPI codes. Case I: Sparse Matrix Vector Multiplication.

thom
Download Presentation

Typical performance bottlenecks and how they can be found

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, TechnischeUniversitätDresden

  2. Outline • Case I: • Finding load imbalances in OpenMP codes • Case II: • Finding communication and computation imbalances in MPI codes

  3. Case I: Sparse Matrix Vector Multiplication • Matrix has significant more zero elements => sparse matrix • Only non-zero elements of are saved efficiently in memory • Algorithm: foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

  4. Case I: Sparse Matrix Vector Multiplication • Naïve OpenMP Algorithm: • Distributes the rows of A evenly across the threads in the parallel region • The distribution of the non-zero elements may influence the load balance in the parallel application #pragma omp parallel for foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

  5. Case I: Load imbalances in OpenMP codes • Measuring the static OpenMP application % cd ~/Bottlenecks/smxv % make PREP=scorep scorepgcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp.inc.c"' -DFUNCTION=y_Ax_omp \ -o smxv-ompsmxv.c -lm scorepgcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp-dynamic.inc.c"‘ \ -DFUNCTION=y_Ax_omp_dynamic -o smxv-omp-dynamic smxv.c -lm % OMP_NUM_THREADS=8 scan –t ./smxv-ompyax_large.bin

  6. Case I: Load imbalances in OpenMP codes: Profile • Two metrics which indicate load imbalances: • Time spent in OpenMP barriers • Computational imbalance • Open prepared measurement on the LiveDVD with Cube % cube ~/Bottlenecks/smxv/scorep_smxv-omp_large/trace.cubex [CUBE GUI showing trace analysis report]

  7. Case I: Time spent in OpenMP barriers These threads spent up to 20% of there running time in the barrier

  8. Case I: Computational imbalance Master thread does 66% of the work

  9. Case I: Sparse Matrix Vector Multiplication • Improved OpenMP Algorithm • Distributes the rows of A dynamically across the threads in the parallel region #pragma omp parallel for schedule(dynamic,1000) foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

  10. Case I: Profile Analysis • Two metrics which indicate load imbalances: • Time spent in OpenMP barriers • Computational imbalance • Open prepared measurement on the LiveDVD with Cube: % cube ~/Bottlenecks/smxv/scorep_smxv-omp-dynamic_large/trace.cubex [CUBE GUI showing trace analysis report]

  11. Case I: Time spent in OpenMP barriers All threads spent similar time in the barrier

  12. Case I: Computational imbalance Threads do nearly equal work

  13. Case I: Trace Comparison • Open prepared measurement on the LiveDVD with Vampir: % vampir ~/Bottlenecks/smxv/scorep_smxv-omp_large/traces.otf2 [VampirGUI showing trace]

  14. Case I: Time spent in OpenMP barriers Improved runtime Less time in OpenMP barrier

  15. Case I: Computational imbalance Great imbalance for time spent in computational code Great imbalance for time spent in computational code

  16. Outline • Case I: • Finding load imbalances in OpenMP codes • Case II: • Finding communication and computation imbalances in MPI codes

  17. Case II: Heat Conduction Simulation • Calculating the heat conduction at each timestep • Discretized formula for space and time

  18. Case II: Heat Conduction Simulation • Application uses MPI for boundary exchange and OpenMP for computation • Simulation grid is distributed across MPI ranks

  19. Case II: Heat Conduction Simulation • Ranks need to exchange boundaries before next iteration step

  20. Case II: Profile Analysis • MPI Algorithm • Building and measuring the heat conduction application: • Open prepared measurement on the LiveDVD with Cube foreach step in [1:nsteps] exchangeBoundaries computeHeatConduction % cd ~/Bottlenecks/heat % make PREP=‘scorep --user’ [... make output ...] % scan –t mpirun –np 16 ./heat-MPI 3072 32 % cube ~/Bottlenecks/heat/scorep_heat-MPI_16/trace.cubex [CUBE GUI showing trace analysis report]

  21. Case II: Hide MPI communication with computation • Step 1: Compute heat in the area which is communicated to your neighbors

  22. Case II: Hide MPI communication with computation • Step 2: Start communicating boundaries with your neighbors

  23. Case II: Hide MPI communication with computation • Step 3: Compute heat in the interior area

  24. Case II: Profile Analysis • Improved MPI Algorithm • Measuring the improved heat conduction application: • Open prepared measurement on the LiveDVD with Cube foreach step in [1:nsteps] computeHeatConductionInBoundaries startBoundaryExchange computeHeatConductionInInterior waitForCompletionOfBoundaryExchange % scan –t mpirun –np 16 ./heat-MPI-overlap 3072 32 % cube ~/Bottlenecks/heat/scorep_heat-MPI-overlap_16/trace.cubex [CUBE GUI showing trace analysis report]

  25. Case II: Trace Comparison • Open prepared measurement on the LiveDVD with Vampir: % vampir ~/Bottlenecks/heat/scorep_heat-MPI_16/traces.otf2 [VampirGUI showing trace]

  26. Acknowledgments • Thanks to Dirk Schmidl, RWTH Aachen, for providing the sparse matrix vector multiplication code

More Related