1 / 24

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers. Priya Unnikrishnan IBM Toronto Lab priyau@ca.ibm.com CASCON 2005. Overview. Parallelization in IBM XL compilers Outlining Automatic parallelization Cost analysis Controlled parallelization Future work.

sigmund
Download Presentation

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya UnnikrishnanIBM Toronto Labpriyau@ca.ibm.comCASCON 2005

  2. Overview • Parallelization in IBM XL compilers • Outlining • Automatic parallelization • Cost analysis • Controlled parallelization • Future work

  3. Parallelization • IBM XL compilers support Fortran 77/90/95, C and C++ • Implements both OpenMP and Auto-parallelization. • Both target SMP (shared memory parallel) machines • Non-threadsafe code generated by default • Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code

  4. Parallelization options

  5. Outlining • Parallelization transformation

  6. long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) endif return main; } Outlining Runtime call int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Outlined routine

  7. SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..) main@OL@1(0,9) main@OL@1(10,19) main@OL@1(30,39) main@OL@1(20,29) The outlined function is parameterized – can be invoked for different ranges in the iteration space

  8. Auto-parallelization • Integrated framework for OpenMP and auto-parallelization • Auto-parallelization is restricted to loops. • Auto-parallelization is done in the link step when possible. • This allows us to perform various interprocedural analysis and optimizations before automatic parallelization

  9. Auto-parallelization transformation int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… } } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Outlining

  10. We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!! int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } + Outlining

  11. Pre-parallelization phase • Loop Normalization (normalize countable loops) • Scalar privatization • Array privatization • Reduction variable analysis • Loop interchange (that helps parallelization)

  12. Cost Analysis • Automatic parallelization tests • Dependence analysis : Is it safe to parallelize ?? • Cost analysis : Is it worthwhile to parallelize ?? • Cost analysis: Estimates the total workload of the loop • LoopCost = ( IterationCount * ExecTimeOfLoopBody ) • Cost known at compile time – trivial • Runtime cost analysis is more complex

  13. long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Conditional Parallelization Runtime check int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }

  14. Runtime cost analysis challenges • Runtime checks should be • Light weight : should not introduce large overhead in applications that are mostly serial • Overflow problems : leads to incorrect decision – costly!! loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* … • Restricted to integer operations • Should be accurate • Balance all the above factors

  15. long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Runtime dependence test Runtime dependence int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Work by Peng Zhao

  16. Controlled parallelization • Cost analysis  selects big loops • Controlled parallelization • Selection is not enough • Parallel performance dependent on ( amount of work + number of processors used) • Using large number of processors for a small loop  huge degradations !!

  17. Measured on a 64-way Power5 processor Small is good !!!

  18. Controlled parallelization • Introduce another runtime parameter IPT (minimum iterations per thread) • The IPT is passed to the SMP runtime • SMP runtime limits the number of threads working on the parallel loop based on IPT • IPT = function( loop_cost, mem access info .. )

  19. long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,IPT) endif } else main@OL@1(0,0,(unsigned)n,0) } return main; } Controlled Parallelization Runtime parameter int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }

  20. SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT) { threadsUsed = IterCount/IPT if (threadsUsed > threadsAvailable) threadsUsed = threadsAvailable ….. ….. }

  21. Controlled parallelization for OpenMP • Improves performance and scalability • Allows fine grained control at loop level granularity • Can be applied to OpenMP loops as well • Adjust number of threads when ENV variable OMP_DYNAMIC is turned on. • Issues with threadprivate data • Encouraging results in galgel

  22. Measured on a 64-way Power5 processor

  23. Future work • Improve cost analysis algorithm and fine tune heuristics • Implement interprocedural cost analysis. • Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability • Implement interprocedural dependence analysis

More Related