Panagiotis Athanasopoulos EPFL Philip Brisk UCR

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) First_name.Second_name@{epfl.ch|ucr.edu}

Motivation • Classic Challenge • Increase performance while maintaining area/cost constrained • Typical solutions • Customizable and extensible processors • Instruction set extension (ISE) • Custom functional units (CFU) • Architecturally visible storage (AVS)

Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . . } for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . . } }

Typical embedded application extract • 2D DCT 8x8 Matrix for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . . } Row accesses 1D DCT Slice Data accessed in row i, column j

Typical embedded application extract • 2D DCT 8x8 Matrix for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . . } Column accesses 1D DCT Slice Data accessed in row i, column j

Speeding up the execution • ISE • Extend the basic processor instruction set with a new instruction: DCT_instr • CFU • Assign the execution of the new instruction to a dedicated unit

Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15) . . DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56) . . DCT_instr(7,15,23,...,63) } 16 executions

Speeding up the execution • Memory bandwidth • Usually limited to 2 read/write ports • Caches, scratchpads, architecturally visible storage • Area quadruplicates to the number of ports [ref] • Increased latency to execute the new instruction until all data is available

Speeding up the execution • Ideally • 8 read 8 write ports • Minimum area • Full bandwidth utilization • Could we achieve this???

Speeding up the execution • Minimum Area • What is the minimum memory organization for 64 elements with 8 read and 8 write ports? • 8 individual single port 8 word capacity memory arrays (Flip Flop)

Speeding up the execution • Full bandwidth utilization Row Major Order Good for row accesses Bad for column accesses 1D DCT Slice 1D DCT Slice

Speeding up the execution • Full bandwidth utilization Column Major Order Good for column accesses Bad for row accesses 1D DCT Slice 1D DCT Slice

Speeding up the execution • Full bandwidth utilization • Would there exist a data layout that would allow row and column access with the same latency ??? • Not with the existing organization • What if we attempted to relax the requirements by ignoring the misalignment of data ??? • Introduce alignment layers • Form of Register Clustering that is cheap! [RWTH ICCAD’07]

1D DCT Slice

Memory Area Comparison Area mm2

Methodology • Optimizing the memory system • Enumerate Memories • Memory Organization • Cost Estimation • Data Layout • Limitedly Improper Constrained Color Assignment • Alignment Layer

LICCA Formulation • Input: • Graph G = (V,E,I) • Vertices V = {v0,...,vn-1} • Edges E = {e0,...,em-1} • Set of Set of vertices I = {I0,...,IL-1} • Where: • E = {(vx, vy)|∃Ij∈E∋vx∈Ij and vy∈Ij}

LICCA Formulation • Solution: • Assignment of colors to vertices • Every function f: V→{0,..., k-1} • A maximum of nivertices can receive color i, 0<i<k-1; that is, |{v∈V| f(v) = i}| < ni • For each set Ij∈I, there can be at most ai vertices that receive color i. • Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy)∈E}}, and, for 0<i<k-1: ni=|V| and ai=1

LICCA Relation to the problem • Relation to the problem: • An edge e = (vx, vy) indicates that vxand vyare read in the same cycle • Each set of vertices Ij ∈I is a set of vertices that are read in parallel • k is the number of memories • ni is the capacity of the ithmemory • ai is the number read/write ports of the ith memory

LICCA Example • V = {v0,v1,v2,v3,v4,v5} • I0 = {v0,v1,v2} • I1 = {v3,v4,v5} • I2 = {v0,v2,v5} • E = {(v0,v1),(v0,v2),(v0,v5),(v1,v2),(v2,v5),(v3,v4),(v3,v5),(v4,v5)} • Legal k-coloring? • Legal LICCA coloring? v0 v3 v1 v4 v2 v5 G

LICCA Example M0 M1 v1 v0 v0 v1 v2 I0 v2 v4 v3 v3 v4 v5 I1 v5 n1=2 a1=1 v0 v2 v5 I2 n0=4 a0=2

Comparison Example AVS (Single/Dual Port Memory or 8x8 Non-clustered RF) Memory Decoder Main Memory Baseline Processor Ports (DMA) ISE Logic RF Baseline Processor

Comparison Example AVS (8x8 clustered RF) Memory Decoder Main Memory Baseline Processor Ports (DMA) Alignment Layer Decoders Alignment Layer RF Baseline Processor ISE Logic Alignment Layer

Comparison Example • 2D DCT 8x8 Matrix • DCT row/column Slice VS 2-point • 8x8 Clustered RF VS Single port Memory • 150 MHz • 2D FFT 8x8 Matrix • 12 butterfly VS 1 butterfly • 8x8 Clustered RF VS Single port Memory • 150 MHz

Comparison Example • 2D DCT 8x8 Matrix 3x 8x

Comparison Example • 2D FFT 8x8 Matrix 2,5x 12x

Conclusion • Methodology to efficiently increase bandwidth to AVS enhanced ISEs • LICCA • Memory System Optimization • Future Work • Commutativity • LICCA Extension for multiple ISEs and shift registers

References

Thank you! Questions?

Panagiotis Athanasopoulos EPFL Philip Brisk UCR