1 / 24

Optimizing Loop Performance for Clustered VLIW Architectures

by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments). Optimizing Loop Performance for Clustered VLIW Architectures. Clustered VLIW Architecture. Motivation.

vgaines
Download Presentation

Optimizing Loop Performance for Clustered VLIW Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments) Optimizing Loop Performance for Clustered VLIW Architectures

  2. Clustered VLIW Architecture

  3. Motivation • Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. • The compiler must • Expose maximal parallelism, • Maintain minimal communication overhead. • High-level optimizations can improve loop performance on clustered VLIW machines.

  4. Background • Software Pipelining – modulo scheduling • Archive ILP by overlapping execution of different loop iterations. • Initiation Interval (II) • ResII -- constraints from the machine resources. • RecII -- constraints from the dependence recurrences. MinII = max(ResII, RecII)

  5. Loop Transformations • Scalar Replacement • replace array references with scalar variables. • improve register usage for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t +b[j] * x[i][j]; a[i] = t; } for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j];

  6. Loop Transformations • Unrolling • reduce inter-iteration overhead • enlarge loop body size • Unroll-and-jam • balance the computation and memory-access requirements • improve uMinII (MinII / unrollAmount) (1 computational unit, 1 memory unit) unroll-and-jammed loop: original loop: for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; uMinII = 4 uMinII = 3

  7. Loop Transformations • Unroll-and-jam/unrolling • generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; } for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }

  8. Loop Transformations • Loop Alignment • Remove loop-carried dependences • Alignment conflicts • Used to determine intercluster communication cost x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } for (i=1; i<n; ++i) { a[i] = b[i] + q; c[i] = a[i-1] + a[i-2]; } for (i=1; i<n; ++i) a[i] = a[i-1] + b[i]; <2> <1>

  9. Related Work • Partitioning Problem • Ellis -- BUG • Capitanio et al. -- LC-VLIW • Nystrom et al. -- cluster assignment & software pipelining • Ozer et al. -- UAS • Sanchez et al. -- unified method • Hiser et al. – RCG • Aleta et al. – pseudo-scheduler

  10. Unrolling/Ujam Callahan et al -- pipelined architectures Carr,Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files • Loop Transformations • Scalar Replacement • Callahan, et al -- pipelined architectures • Carr, Kennedy -- general algorithm • Duesterwalk -- data flow framework • Loop Alignment • Allen et al -- shared-memory machines

  11. Optimization Strategy Unroll-and-jam/Unrolling Scalar Replacement Source Code Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Target Code Assembly Code Generator

  12. Our Method • Picking loops to unroll • Computing uMinII • Computing register pressure (see paper) • Determining unroll amounts

  13. Computing uMinII uRecII does not increase uResII where • Picking Loops to Unroll • : carries the most dep. that are amenable to S.R. • : contains the fewest alignment conflicts.

  14. Computing Communication Cost for Unrolled Loops Intercluster Copies single loop multiple loops (see paper) invariant dep. variant dep. invariant dep. variant dep. innermost loop is unrolled innermost loop is not unrolled

  15. sinks of the new dependences: copies per cluster: = # of e where total costs: Unrolling a Single Loop Before unrolling • Variant Dep. After unrolling Cluster 1 Cluster? ... ...

  16. Unrolling a Single Loop • Variant Dep. • Special Cases if , then if , then 2 clusters: 4 clusters: for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; } for (i=0; i<4*n; i+=4) { a[i] = a[i-4]; a[i+1] = a[i–3]; a[i+2] = a[i–2]; a[i+3] = a[i-1]; }

  17. Unrolling a Single Loop references can be eliminated by scalar replacement. clusters need a copy operation. • Invariant Dep. for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; } for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];

  18. Determining Unroll Amounts • Integer optimization problem • Exhaustive search • Heuristic method

  19. Experimental Results • Benchmarks • 119 DSP loops from the TI's benchmark suite • DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc. • Architectures • URM, a simulated architecture • 8 functional units - 2 clusters, 4 clusters (1 copy unit) • 16 functional units - 2 clusters, 4 clusters (2 copy units) • TMS320C64x

  20. URM Speedups: Transformed vs. Original • Unroll-and-jam/unrolling is applicable to 71 loops.

  21. Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant.

  22. TMS320C64x Speedups: Unrolled vs. Original • C64x Results

  23. Accuracy of Communication Cost Model • Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops • 2-cluster: 66 exact prediction 4-cluster: 64 exact prediction

  24. Conclusion • Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. • 70%-90% of 71 loops can be improved by a speedup of 1.4-1.7. • High-level transformations should be an integral part of compilation for clustered VLIW machines.

More Related