240 likes | 258 Views
by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments). Optimizing Loop Performance for Clustered VLIW Architectures. Clustered VLIW Architecture. Motivation.
E N D
by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments) Optimizing Loop Performance for Clustered VLIW Architectures
Motivation • Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. • The compiler must • Expose maximal parallelism, • Maintain minimal communication overhead. • High-level optimizations can improve loop performance on clustered VLIW machines.
Background • Software Pipelining – modulo scheduling • Archive ILP by overlapping execution of different loop iterations. • Initiation Interval (II) • ResII -- constraints from the machine resources. • RecII -- constraints from the dependence recurrences. MinII = max(ResII, RecII)
Loop Transformations • Scalar Replacement • replace array references with scalar variables. • improve register usage for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t +b[j] * x[i][j]; a[i] = t; } for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j];
Loop Transformations • Unrolling • reduce inter-iteration overhead • enlarge loop body size • Unroll-and-jam • balance the computation and memory-access requirements • improve uMinII (MinII / unrollAmount) (1 computational unit, 1 memory unit) unroll-and-jammed loop: original loop: for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; uMinII = 4 uMinII = 3
Loop Transformations • Unroll-and-jam/unrolling • generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; } for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }
Loop Transformations • Loop Alignment • Remove loop-carried dependences • Alignment conflicts • Used to determine intercluster communication cost x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } for (i=1; i<n; ++i) { a[i] = b[i] + q; c[i] = a[i-1] + a[i-2]; } for (i=1; i<n; ++i) a[i] = a[i-1] + b[i]; <2> <1>
Related Work • Partitioning Problem • Ellis -- BUG • Capitanio et al. -- LC-VLIW • Nystrom et al. -- cluster assignment & software pipelining • Ozer et al. -- UAS • Sanchez et al. -- unified method • Hiser et al. – RCG • Aleta et al. – pseudo-scheduler
Unrolling/Ujam Callahan et al -- pipelined architectures Carr,Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files • Loop Transformations • Scalar Replacement • Callahan, et al -- pipelined architectures • Carr, Kennedy -- general algorithm • Duesterwalk -- data flow framework • Loop Alignment • Allen et al -- shared-memory machines
Optimization Strategy Unroll-and-jam/Unrolling Scalar Replacement Source Code Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Target Code Assembly Code Generator
Our Method • Picking loops to unroll • Computing uMinII • Computing register pressure (see paper) • Determining unroll amounts
Computing uMinII uRecII does not increase uResII where • Picking Loops to Unroll • : carries the most dep. that are amenable to S.R. • : contains the fewest alignment conflicts.
Computing Communication Cost for Unrolled Loops Intercluster Copies single loop multiple loops (see paper) invariant dep. variant dep. invariant dep. variant dep. innermost loop is unrolled innermost loop is not unrolled
sinks of the new dependences: copies per cluster: = # of e where total costs: Unrolling a Single Loop Before unrolling • Variant Dep. After unrolling Cluster 1 Cluster? ... ...
Unrolling a Single Loop • Variant Dep. • Special Cases if , then if , then 2 clusters: 4 clusters: for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; } for (i=0; i<4*n; i+=4) { a[i] = a[i-4]; a[i+1] = a[i–3]; a[i+2] = a[i–2]; a[i+3] = a[i-1]; }
Unrolling a Single Loop references can be eliminated by scalar replacement. clusters need a copy operation. • Invariant Dep. for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; } for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];
Determining Unroll Amounts • Integer optimization problem • Exhaustive search • Heuristic method
Experimental Results • Benchmarks • 119 DSP loops from the TI's benchmark suite • DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc. • Architectures • URM, a simulated architecture • 8 functional units - 2 clusters, 4 clusters (1 copy unit) • 16 functional units - 2 clusters, 4 clusters (2 copy units) • TMS320C64x
URM Speedups: Transformed vs. Original • Unroll-and-jam/unrolling is applicable to 71 loops.
Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant.
TMS320C64x Speedups: Unrolled vs. Original • C64x Results
Accuracy of Communication Cost Model • Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops • 2-cluster: 66 exact prediction 4-cluster: 64 exact prediction
Conclusion • Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. • 70%-90% of 71 loops can be improved by a speedup of 1.4-1.7. • High-level transformations should be an integral part of compilation for clustered VLIW machines.