Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya UnnikrishnanIBM Toronto Labpriyau@ca.ibm.comCASCON 2005

Overview • Parallelization in IBM XL compilers • Outlining • Automatic parallelization • Cost analysis • Controlled parallelization • Future work

Parallelization • IBM XL compilers support Fortran 77/90/95, C and C++ • Implements both OpenMP and Auto-parallelization. • Both target SMP (shared memory parallel) machines • Non-threadsafe code generated by default • Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code

Parallelization options

Outlining • Parallelization transformation

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) endif return main; } Outlining Runtime call int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Outlined routine

SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..) main@OL@1(0,9) main@OL@1(10,19) main@OL@1(30,39) main@OL@1(20,29) The outlined function is parameterized – can be invoked for different ranges in the iteration space

Auto-parallelization • Integrated framework for OpenMP and auto-parallelization • Auto-parallelization is restricted to loops. • Auto-parallelization is done in the link step when possible. • This allows us to perform various interprocedural analysis and optimizations before automatic parallelization

Auto-parallelization transformation int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… } } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Outlining

We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!! int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } } + Outlining

Pre-parallelization phase • Loop Normalization (normalize countable loops) • Scalar privatization • Array privatization • Reduction variable analysis • Loop interchange (that helps parallelization)

Cost Analysis • Automatic parallelization tests • Dependence analysis : Is it safe to parallelize ?? • Cost analysis : Is it worthwhile to parallelize ?? • Cost analysis: Estimates the total workload of the loop • LoopCost = ( IterationCount * ExecTimeOfLoopBody ) • Cost known at compile time – trivial • Runtime cost analysis is more complex

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Conditional Parallelization Runtime check int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }

Runtime cost analysis challenges • Runtime checks should be • Light weight : should not introduce large overhead in applications that are mostly serial • Overflow problems : leads to incorrect decision – costly!! loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* … • Restricted to integer operations • Should be accurate • Balance all the above factors

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,0) } else main@OL@1(0,0,(unsigned)n,0) endif endif return main; } Runtime dependence test Runtime dependence int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; } Work by Peng Zhao

Controlled parallelization • Cost analysis  selects big loops • Controlled parallelization • Selection is not enough • Parallel performance dependent on ( amount of work + number of processors used) • Using large number of processors for a small loop  huge degradations !!

Measured on a 64-way Power5 processor Small is good !!!

Controlled parallelization • Introduce another runtime parameter IPT (minimum iterations per thread) • The IPT is passed to the SMP runtime • SMP runtime limits the number of threads working on the parallel loop based on IPT • IPT = function( loop_cost, mem access info .. )

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208, &main@OL@1,0,n,5,0, @_xlsmpEntry0,0,0,0,0,0,IPT) endif } else main@OL@1(0,0,(unsigned)n,0) } return main; } Controlled Parallelization Runtime parameter int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } } + Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return; }

SMP parallel runtime _xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT) { threadsUsed = IterCount/IPT if (threadsUsed > threadsAvailable) threadsUsed = threadsAvailable ….. ….. }

Controlled parallelization for OpenMP • Improves performance and scalability • Allows fine grained control at loop level granularity • Can be applied to OpenMP loops as well • Adjust number of threads when ENV variable OMP_DYNAMIC is turned on. • Issues with threadprivate data • Encouraging results in galgel

Measured on a 64-way Power5 processor

Future work • Improve cost analysis algorithm and fine tune heuristics • Implement interprocedural cost analysis. • Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability • Implement interprocedural dependence analysis

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Presentation Transcript

An Ounce of Prevention: Controlling plant viruses in the nursery and landscape

Characteristics of Chinese mathematics

Monitoring and Controlling the Project

Introduction to Managerial Accounting

Timecard : Controlling User-Perceived Delays in Server-Based Mobile Applications

Introduction to Fortran

Reducing and controlling deductions Increasing profitability by improving retail compliance

CEVA’S APPROACH F O R CONTROLLING GUMBORO DISEASE

Controlling Costs Through Claims Management

1. The macro-economy: a theoretical model 2. Controlling the economy: fiscal policy

HPCI Centre Presentation

Controlling Salmonella and Listeria in Low Moisture Food Manufacturing Facilities

Improving and Controlling User-Perceived Delays in Mobile Apps

Chapter 17: Air Pollution

Controlling Diabetes During the Holidays

1.1 Basics

Acid-Controlling Drugs