240 likes | 446 Views
Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers. Priya Unnikrishnan IBM Toronto Lab priyau@ca.ibm.com CASCON 2005. Overview. Parallelization in IBM XL compilers Outlining Automatic parallelization Cost analysis Controlled parallelization Future work.
E N D
Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya UnnikrishnanIBM Toronto Labpriyau@ca.ibm.comCASCON 2005
Overview • Parallelization in IBM XL compilers • Outlining • Automatic parallelization • Cost analysis • Controlled parallelization • Future work
Parallelization • IBM XL compilers support Fortran 77/90/95, C and C++ • Implements both OpenMP and Auto-parallelization. • Both target SMP (shared memory parallel) machines • Non-threadsafe code generated by default • Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code
Outlining • Parallelization transformation
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) endif return main; } Outlining Runtime call int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Outlined routine
SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..) main@OL@1(0,9) main@OL@1(10,19) main@OL@1(30,39) main@OL@1(20,29) The outlined function is parameterized – can be invoked for different ranges in the iteration space
Auto-parallelization • Integrated framework for OpenMP and auto-parallelization • Auto-parallelization is restricted to loops. • Auto-parallelization is done in the link step when possible. • This allows us to perform various interprocedural analysis and optimizations before automatic parallelization
Auto-parallelization transformation int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… } } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Outlining
We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!! int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } + Outlining
Pre-parallelization phase • Loop Normalization (normalize countable loops) • Scalar privatization • Array privatization • Reduction variable analysis • Loop interchange (that helps parallelization)
Cost Analysis • Automatic parallelization tests • Dependence analysis : Is it safe to parallelize ?? • Cost analysis : Is it worthwhile to parallelize ?? • Cost analysis: Estimates the total workload of the loop • LoopCost = ( IterationCount * ExecTimeOfLoopBody ) • Cost known at compile time – trivial • Runtime cost analysis is more complex
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Conditional Parallelization Runtime check int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }
Runtime cost analysis challenges • Runtime checks should be • Light weight : should not introduce large overhead in applications that are mostly serial • Overflow problems : leads to incorrect decision – costly!! loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* … • Restricted to integer operations • Should be accurate • Balance all the above factors
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Runtime dependence test Runtime dependence int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Work by Peng Zhao
Controlled parallelization • Cost analysis selects big loops • Controlled parallelization • Selection is not enough • Parallel performance dependent on ( amount of work + number of processors used) • Using large number of processors for a small loop huge degradations !!
Measured on a 64-way Power5 processor Small is good !!!
Controlled parallelization • Introduce another runtime parameter IPT (minimum iterations per thread) • The IPT is passed to the SMP runtime • SMP runtime limits the number of threads working on the parallel loop based on IPT • IPT = function( loop_cost, mem access info .. )
long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,IPT) endif } else main@OL@1(0,0,(unsigned)n,0) } return main; } Controlled Parallelization Runtime parameter int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }
SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT) { threadsUsed = IterCount/IPT if (threadsUsed > threadsAvailable) threadsUsed = threadsAvailable ….. ….. }
Controlled parallelization for OpenMP • Improves performance and scalability • Allows fine grained control at loop level granularity • Can be applied to OpenMP loops as well • Adjust number of threads when ENV variable OMP_DYNAMIC is turned on. • Issues with threadprivate data • Encouraging results in galgel
Future work • Improve cost analysis algorithm and fine tune heuristics • Implement interprocedural cost analysis. • Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability • Implement interprocedural dependence analysis