Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators Manjunath Kudlur, Kevin Fan, Michael Chu, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Motivation • Custom application accelerators (ASICs/ASIPs) require careful data memory system design • Large volumes of data access at high bandwidth • Distributed local memories (scratchpads) • Achieves high bandwidth through parallel access • Low latency by placing data near computation • Custom memory design is complex • Multiple considerations– bandwidth, size requirements, data distribution • Decentralized datapath – another monkey wrench

Background – Our System • Synthesis of non-programmable accelerators • System similar to PICO (Program-In Chip-Out) • Input is “Hot” loop nest expressed in C • Throughput-directed synthesis • Required throughput expressed as II (initiation interval) • Innermost loop modulo scheduled • Datapath derived directly from the schedule • FU allocation to meet II

Background – Multicluster Datapath • FUs divided into clusters • Intercluster communication through global bus • Reduced wire lengths, reduced porting on register file structures • Increased compiler complexity Interconnection Network C Program Cluster 1 Cluster 2 Register FIFOs Register FIFOs FU FU MEM MEM FU FU MEM MEM Local Memories Local Memories

Background – Local Memories • SRAMs connected to MEM units in clusters • Data structures assigned to a single SRAM • Can be whole arrays, part of an array • Currently whole arrays considered • Multiple arrays can be combined in a single SRAM Cluster 1 Register FIFOs FU FU MEM MEM Local Memories

Problem Statement and Approach • “Given a set of arrays, their sizes and bitwidths, the corresponding loop nest, the number of clusters and the target II, find an allocation of arrays to SRAMs and allocation of SRAMs to clusters such that overall cost is minimized” • Phase-ordered approach which handles 2 sub problems separately • Memory synthesis • Operation partitioning

A1 + A2 II Combining Arrays • Combining arrays into a single SRAM reduces hardware cost (row decoders, sense amps) • Issues with combining: • Consider two arrays with (Bitwidth, Size) = (B1, S1) and (B2, S2) • Suppose A1 and A2 are number of static accesses in the loop • Number of ports = MAX(B1, B2) B1 B2 X X Y S2 S1 + S2 S1 Y

Combining Arrays • Multicluster issues • Can cause imbalance in operation distribution • All load store operations for the combined arrays should be assigned to same cluster • Can increase inter cluster traffic • Address calculations and load-uses would cause extra inter cluster moves R1 R2 + IC Move LD USE

D A B C Cluster 1 Cluster 2 Solution 1 • Formulate the problem as an integer program • A binary decision variable X(i,j,k,l) to denote assignment of array ‘i’ to local memory ‘j’ with ‘k’ ports on cluster ‘l’ • Constraints to make sure inter cluster move bandwidth is not violated • Perform operation partitioning and Modulo schedule after memory synthesis A B C D Input Arrays Target II Memory Synthesis Operation Partitioning Modulo Schedule

Experiments • System implemented in the Trimaran framework • Memory costs obtained from ARTISAN SRAM generator scripts • lp_solve used to solve the integer programs • A set of DSP kernels evaluated • Loop oriented • Many arrays accessed in the loops

Results for Solution 1 huffman channel Target Initiation Interval (II) Target Initiation Interval (II) LU lyapunov Target Initiation Interval (II) Target Initiation Interval (II)

Achieved II in Solution 1 • Solution 1 eagerly combines arrays • Potential increase in inter cluster moves due to imbalance in distribution of LD/ST ops • Achieved II poor due to IC moves in recurrence cycles Best II achieved

Solution 2 • Phase-ordered approach • Two highly intertwined decisions: allocation of local memories and partitioning of operations • Three phases: • Pre-Partitioning • Memory Synthesis • Operation Partitioning

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Pre-Partitioning • Performance-oriented operation partitioning • Memory operations accessing the same arrays are bound to same cluster • Consequently, arrays are bound to clusters A C E B D Cluster 2 Cluster 1 Pre-Partitioning

Memory Synthesis • ILP used to optimally combine arrays within clusters • Pre-partitioning effectively disables combining of arrays that cause operation imbalance D A B A C E B D C E Cluster 1 Cluster 2 Cluster 2 Cluster 1 Memory Synthesis

Results for Solution 2 channel huffman Target Initiation Interval (II) Target Initiation Interval (II) LU lyapunov Target Initiation Interval (II) Target Initiation Interval (II)

Achieved II for Solution 2 • Cost of synthesized memory not substantially different • But achieved II is 36% better with pre-partitioning Best II achieved

Conclusion • An approach for synthesizing custom local memories • ILP based optimal solution • Works for clustered datapath • Pre-partitioning to improve achieved throughput, with minimal impact on cost • For more information • http://cccp.eecs.umich.edu

Example

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators