Optimizing Array Accesses in High Productivity Languages Mackale Joyner Rice University

Optimizing Array Accesses in High Productivity Languages Mackale Joyner Rice University Zoran Budimlić Vivek Sarkar Rice University

Introduction • Defense Advanced Research Projects Agency (DARPA) challenge • Increase development productivity by factor of 10 by 2010 • Need new parallel languages • Chapel, Fortress, X10 • Language abstractions needed to improve productivity • Data-parallel, task-parallel, data-locality, high-level iteration, high-level arrays with general distributions, global address space, implicit processor communication and synchronization …etc. • Compiler optimizations needed to minimize performance penalties • Broad acceptance by scientific community

Motivation • Existing languages impose tradeoff between high-productivity and high-performance • MPI with Fortran/C/C++ • de facto standard in high-performance computing • Programmer must manage low-level details • HPF • Program sequentially with distribution array annotations • Challenging to tune performance, huge compiler effort required • Java • Object-oriented programming style • Not designed with high-performance computing in mind

Why is This Work Relevant? • What is the scope? • Developing and evaluating productive, efficient high-level array computation inside loops • High-productivity computing parallel programming languages • Why is it important? • Scientific community has recognized the need for substantial improvements in development productivity • Why is it hard? • Challenging to demonstrate benefits of work in new high-level languages • Why is it novel? • First to evaluate the performance impact of high-level array computations inside loops in high-productive parallel language environments

Contributions • Addresses performance issues for high-level loop iteration • Primarily targets sequential execution • Good sequential performance enables better parallel performance • Transforms points with object inlining • points are frequently used in loops • Up to 5 x improvement over non-optimized points • Work done in X10, applicable to other high-productivity languages • Demonstrates benefit of dependent types

Related Work • Object Inlining • For object-oriented languages • C++ (Dolby ‘97), Java (Budimlić ‘97) • Requires • escape analysis (Cooper ‘85) • concrete type inference (Agesen and Holzle ‘95) • Our object inlining for points (value type objects) is less general but more effective • can inline all of them • Semantic Inlining (Wu et al. ‘99) • Similar to object inlining, but less general • Incorporates knowledge about library semantics (Complex numbers) • Scalar Replacement (Fink et al. ‘00) • Replaces accesses to memory with scalar temporaries • Point Inlining • Titanium inlines points (Yelick et al. ‘98) • Rank of points must be specified • Only within loops

Outline • Introduction • Motivation • Related Work • Object Inlining of Points • Dependent Types • Experimental Results • Conclusion

X10 Compiler Frontend No Bounds Check Analysis X10 Array Rank Analysis Transform X10 Array Parsing, Type Checker,…etc AST AST AST AST X10 Canonical Form Point Rank Analysis Point Inlining Code Gen AST AST AST Dependent Types (hand-coded)

X10 Point Language Abstraction • Points and regions are first-class value types • High-level array access • Points support object-oriented iteration approach • Iteration performed over general regions • similar to Chapel sequence iteration • Scientists may develop rank independent computations • Challenges compiler to generate rank specific code • Points introduce performance challenges • Creation and use • Compiler transformations can mitigate performance challenge • Object inlining: replaces object with fields, eliminates object memory and indirect field access costs

X10 Iteration Comparison // Java version:Code example from Java Grande sor benchmark double[][] G = new double[M][N];… int Mm1 = M-1; int Nm1 = N-1; for (int p=0; p<num_iterations; p++) { for (int i=1; i<Mm1; i++) { double[] Gi = G[i]; double[] Gim1 = G[i-1]; double[] Gip1 = G[i+1]; for (int j=1; j<Nm1; j++) { Gi[j] = omega_over_four * (Gim1[j) + Gip1[j] + Gi[j-1] + Gi[j+1] + one_minus_omega * Gi[j]; } } // X10 version: region R = [0:M-1,0:N-1]; double[.] G = new double[R];… region R_inner = [1:M-2,1:N-2]; // R_inner is a subregion of R region stencil = ... ; // Set of points in stencil for (int p=0; p<num_iterations; p++){ for (point t : R_inner){ double sum = one_minus_omega * G[t]; for (point s : stencil) sum += omega_factor * G[t+s]; G[t] = sum; }}

Iteration with Points Code Generation //Code example from Java Grande sparsematmult benchmarkwritten in X10 region arrayRegion1 = [0:datasizes_nz[size]-1];… //X10 for loop, code is rank independent for (point p : arrayRegion1){ row[p] = rowt[p];… col[p] = colt[p];… val[p] = valt[p]; }

Point Code Generation Without Inlining //Code example from Java Grande sparsematmult benchmarkwritten in X10 //X10 for loop body translated to Java for … { … // Includes code to allocate a new point object for p (row).set(((rowt).get(p)),p); … (col).set(((cowt).get(p)),p); … (val).set(((valt).get(p)),p); … }

Optimizing X10 Points • X10 points are value type objects • Fields are immutable after initialization • Value types help compiler transformations such as object inlining • Eliminate alias analysis • Safely inline all points • Must perform type analysis to inline points • Programmer may omit rank (dimensionality)

Point Inlining Algorithm //init pass for each region r r’s rank = TOP for each point p p’s rank = TOP //gather rank information for each AST node n case (assignment) if (n.lhs == (point OR region)) n.lhs rank = merge(n.lhs’s rank, n.rhs’s rank) case(X10 loop) point p = s.formal() region r = s.domain() p’s rank = merge(p’s rank, r’s rank) //inline points for each AST node n if (get_rank(n) == CONSTANT) //inlineable point found switch(n) case (point declaration) inline(n) case (point use) inline(n) case (method call argument) reconstruct_point(n) case(loop with formal point) convert_loop(loop) //merge rank using a lattice merge(rank l, rank r) { return l ^ r where : TOP ^ r = r BOTTOM ^ r = BOTTOM c1 ^ c2 = c1, if c1 == c2 else BOTTOM }

X10 Iteration with Points Example //Code example from Java Grande sparsematmult benchmarkwritten in X10 region arrayRegion1 = [0:datasizes_nz[size]-1]; //X10 for loop for (point p : arrayRegion1){ row[p] = rowt[p];… col[p] = colt[p];… val[p] = valt[p]; }

Point Code Generation With Inlining //Code example from Java Grande sparsematmult benchmarkwritten in X10 //X10 for loop body translated to Java, no point needs to be allocated for (int i=0; i <= datasizes_nz[size] -1; i+=1) { (row).set(((rowt).get(i)),i); … (col).set(((cowt).get(i)),i); … (val).set(((valt).get(i)),i); … }

Outline • Introduction • Motivation • Related Work • Object Inlining of Points • Dependent Types • Experimental Results • Conclusion

Dependent Types for Efficient Array Access • Overhead still exists due to get/set methods • Needed because X10 does not specify rank for underlying array • Write rank independent code • Inline point example (sparsematmult) • All arrays are one-dimensional • Generate code with direct array access • Requirements for efficient access in one-dimensional case • Rank = 1, array is one-dimensional • Rectangular, array’s region is dense • ZeroBased, lower bound of array’s region starts at 0.

X10 Iteration with Points Example //Code example from Java Grande sparsematmult benchmarkwritten in X10 region arrayRegion1 = [0:datasizes_nz[size]-1]; //X10 for loop for (point p : arrayRegion1){ row[p] = rowt[p];… col[p] = colt[p];… val[p] = valt[p]; }

X10 Point Inlining with Dependent Types //Code example from Java Grande sparsematmult benchmarkwritten in X10 //X10 array declarations with dependent type information double[:rank==1 && rect && zeroBased] row = …; //X10 for loop body translated to Java for (int I=0; i <= datasizes_nz[size] -1; i+=1) { ((DoubleArray_c) row).arr_[i] = ((DoubleArray_c) rowt).arr_[i]; ((DoubleArray_c) col).arr_[i] = ((DoubleArray_c) colt).arr_[i]; ((DoubleArray_c) val).arr_[i] = ((DoubleArray_c) valt).arr_[i]; }

Experimental Results • We ran all experiments on a 1.25 GHz PowerPC G4 with 1.5 GB of memory • Sun Java Hotspot VM (build 1.5.0_07-87) • Java Grande benchmarks written in X10 • Class A (smallest size) versions of benchmark • 3 versions of benchmarks • Original (no X10 arrays or points) • Original + X10 arrays + points • Original + X10 arrays + points + point inlining + dependent types

Experimental Results Runtime Performance in seconds Optimizing X10 points

Experimental Results Runtime Performance in seconds Comparison of optimized points vs original

Conclusion • This work addresses performance issues related to high-level array access inside loops • Efficient array accesses for high-level general arrays • Point inlining with type analysis • Dependent types • This work focuses on developing and evaluating highly productive yet efficient high-level array computation inside loops for high-productivity parallel languages.

Optimizing Array Accesses in High Productivity Languages Mackale Joyner Rice University

Optimizing Array Accesses in High Productivity Languages Mackale Joyner Rice University

Presentation Transcript

High Productivity Computing

Eric Joyner

High Order Languages

Zipcar at Rice University

Jackie Joyner- Kersee

A Coherence Protocol for Optimizing Global Shared Data Accesses

Moon Speech in Rice University

Joyner Family Heritage

Rice University EMS

Productivity and corpus languages

Rice University

High Resolution Array Detector

HPCToolkit / Rice University

High Productivity Languages for Parallel Programming Compared to MPI

High Level Languages

Optimizing Memory Accesses for Spatial Computation

High-level Languages

High Productivity Computing

Jackie Joyner-Kersee

Joyner Realty, LLC