Optimizing Loop Performance for Clustered VLIW Architectures

by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments) Optimizing Loop Performance for Clustered VLIW Architectures

Clustered VLIW Architecture

Motivation • Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. • The compiler must • Expose maximal parallelism, • Maintain minimal communication overhead. • High-level optimizations can improve loop performance on clustered VLIW machines.

Background • Software Pipelining – modulo scheduling • Archive ILP by overlapping execution of different loop iterations. • Initiation Interval (II) • ResII -- constraints from the machine resources. • RecII -- constraints from the dependence recurrences. MinII = max(ResII, RecII)

Loop Transformations • Scalar Replacement • replace array references with scalar variables. • improve register usage for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t +b[j] * x[i][j]; a[i] = t; } for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j];

Loop Transformations • Unrolling • reduce inter-iteration overhead • enlarge loop body size • Unroll-and-jam • balance the computation and memory-access requirements • improve uMinII (MinII / unrollAmount) (1 computational unit, 1 memory unit) unroll-and-jammed loop: original loop: for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; uMinII = 4 uMinII = 3

Loop Transformations • Unroll-and-jam/unrolling • generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; } for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }

Loop Transformations • Loop Alignment • Remove loop-carried dependences • Alignment conflicts • Used to determine intercluster communication cost x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } for (i=1; i<n; ++i) { a[i] = b[i] + q; c[i] = a[i-1] + a[i-2]; } for (i=1; i<n; ++i) a[i] = a[i-1] + b[i]; <2> <1>

Related Work • Partitioning Problem • Ellis -- BUG • Capitanio et al. -- LC-VLIW • Nystrom et al. -- cluster assignment & software pipelining • Ozer et al. -- UAS • Sanchez et al. -- unified method • Hiser et al. – RCG • Aleta et al. – pseudo-scheduler

Unrolling/Ujam Callahan et al -- pipelined architectures Carr,Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files • Loop Transformations • Scalar Replacement • Callahan, et al -- pipelined architectures • Carr, Kennedy -- general algorithm • Duesterwalk -- data flow framework • Loop Alignment • Allen et al -- shared-memory machines

Optimization Strategy Unroll-and-jam/Unrolling Scalar Replacement Source Code Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Target Code Assembly Code Generator

Our Method • Picking loops to unroll • Computing uMinII • Computing register pressure (see paper) • Determining unroll amounts

Computing uMinII uRecII does not increase uResII where • Picking Loops to Unroll • : carries the most dep. that are amenable to S.R. • : contains the fewest alignment conflicts.

Computing Communication Cost for Unrolled Loops Intercluster Copies single loop multiple loops (see paper) invariant dep. variant dep. invariant dep. variant dep. innermost loop is unrolled innermost loop is not unrolled

sinks of the new dependences: copies per cluster: = # of e where total costs: Unrolling a Single Loop Before unrolling • Variant Dep. After unrolling Cluster 1 Cluster? ... ...

Unrolling a Single Loop • Variant Dep. • Special Cases if , then if , then 2 clusters: 4 clusters: for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; } for (i=0; i<4*n; i+=4) { a[i] = a[i-4]; a[i+1] = a[i–3]; a[i+2] = a[i–2]; a[i+3] = a[i-1]; }

Unrolling a Single Loop references can be eliminated by scalar replacement. clusters need a copy operation. • Invariant Dep. for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; } for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];

Determining Unroll Amounts • Integer optimization problem • Exhaustive search • Heuristic method

Experimental Results • Benchmarks • 119 DSP loops from the TI's benchmark suite • DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc. • Architectures • URM, a simulated architecture • 8 functional units - 2 clusters, 4 clusters (1 copy unit) • 16 functional units - 2 clusters, 4 clusters (2 copy units) • TMS320C64x

URM Speedups: Transformed vs. Original • Unroll-and-jam/unrolling is applicable to 71 loops.

Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant.

TMS320C64x Speedups: Unrolled vs. Original • C64x Results

Accuracy of Communication Cost Model • Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops • 2-cluster: 66 exact prediction 4-cluster: 64 exact prediction

Conclusion • Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. • 70%-90% of 71 loops can be improved by a speedup of 1.4-1.7. • High-level transformations should be an integral part of compilation for clustered VLIW machines.

Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Presentation Transcript

Optimizing for Intel multi-/many-core architectures

Compiler Optimizations for Modern VLIW/EPIC Architectures

Undersubscribed Threading on Clustered Cache Architectures

Optimizing Performance

Optimizing Performance

A Loop Accelerator for Low Power Embedded VLIW Processors

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Clustered Data Cache Designs for VLIW Processors

Exploring Design Space of VLIW Architectures

ILP: VLIW Architectures

Heterogeneous Clustered VLIW Microarchitectures

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Optimizing Loop Performance for Clustered VLIW Architectures

Computer Architecture VLIW Architectures

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Closed Loop Performance

Heterogeneous Clustered VLIW Microarchitectures

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Exploring Design Space for 3D Clustered Architectures

Optimizing System Performance