450 likes | 490 Views
mRNA : Efficient M apping Space Exploration for a R econfigurable N eural A ccelerator. http://synergy.ece.gatech.edu. Zhongyuan Zhao & , Hyoukjun Kwon*, Sachit Kuhar # , Weiguang Sheng & , Zhigang Mao & , and Tushar Krishna*. & Shanghai Jiao Tong University
E N D
mRNA: EfficientMappingSpaceExplorationforaReconfigurableNeural Accelerator http://synergy.ece.gatech.edu Zhongyuan Zhao&, Hyoukjun Kwon*, SachitKuhar#, Weiguang Sheng&, Zhigang Mao&, and Tushar Krishna* &Shanghai Jiao Tong University * Synergy Lab, Georgia Institute of Technology # Indian Institute of Technology Guwahati ISPASS 2019, March 26
Deep Learning Applications Object Detection Image Segmentation Medical Imaging DeepNeuralNetwork(DNN) Speech Recognition Recommendations Games Text to Speech
Challenges of DNN Computing Need lots of parallel compute This makes CPUs inefficient Need to reduce energy This makes GPUs inefficient • Millions of Parameters (i.e., weights) and Inputs • Billions of computations • Heavy data movement
Spatial DNN Accelerators Memory Hierarchy Spread computations across hundreds of ALUs ALU ALU ALU ALU Control Register/FIFO/SRAM ALU ALU ALU ALU Memory Hierarchy Reuse data within the array via local storage and direct communication ALU ALU ALU ALU ALU ALU ALU ALU Examples: MIT Eyeriss, Google TPU, Xilinx xDNN • Millions of Parameters (i.e., weights) • Billions of computations • Heavy data movement
OverallEfficiency Accelerator Microarchitecture DNN Model Mapping (Dataflow) Focus of this talk
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
Deep Learning Landscape is Diverse • DNN Topologies • Layersize / shape • Layer types: Convolution / Pool / FC / LSTM • Newsub-structure: e.g., Inception in Googlenet • Compiler/Mapper • Loop scheduling • Reordering • Tiling • Unrolling • Mapping • Output/Weight/Input/Row Stationary • Algorithmic Optimization • Weight pruning: Sparseworkload
Efficiently Mapping Diverse Dataflows A DNN Accelerator with Flexible Interconnects Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention
The MAERI Implementation • Distribution Network • Spatial Reuse via Multicasts • High Bandwidth via fat links Distribute Switch (1x2 Switch) • Linear Local Network • Forwarding of weights/inputs • Spatial-Temporal Reuse • Reduction Network • High Bandwidth via fat links • Provably Non-blocking Reductions via forwarding links
MAERI Operation example Distribute 2 Input/Weight Fetch 1 Virtual neuron (VN) Multiply 3 Reduce 4 5 6 Collect 7 Controller configures switches VN size = 5 Num of VN=3
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
mRNA functionality To/From DRAM Weight, Input, Output SRAM mRNA Dataflow Configs • Search for efficient DNN mappings • Evaluate the impact of different mappings • Generate interconnection configuration ~100% Utilization March 26, 2019
CNN Layer representation K C C Y’ R Y N S N X’ X … K … K … C C Y’ R Y S X’ X Weightfilters Input feature maps Output activations
Loop optimizations for CNN Loop fission for(n=0; n<N; n=n+T_N) { for(c=0; c<C; c=c+T_C) { for(x=0; x<X’; x=x+T_X’) { for(y=0; y<Y’; y=y+T_Y’) { for(k=0; k<K; k=k+T_K) { for(j=0; j<R; j=j+T_X) { for(i=0; i<S; i=i+T_Y) { for(tn=n; tn<T_N; tn++) for(tk=k; tk<T_K; tk++) { for(tx=x; tx<T_X; tx++) { for(ty=y; ty<T_Y; ty++) { for(tc=c; tc<T_C; tc++) { for(tj=j; tj<T_X; tj++) { for(ti=i; ti<T_Y; ti++) { O[tn][tk][tx][ty] += W[tk][tc][ti][tj] * I[tn][tc][tx+ti][ty+tj]; }}}}}}} for(n=0; n<N; n++) { for(m=0; m<M; m++) { for(y=0; y<O_X; y++) { for(x=0; x<O_Y; x++) { for(c=0; c<C; c++) { for(j=0; j<W_X; j++) { for(i=0; i<W_Y; i++) { O[n][m][x][y] += W[m][c][i][j] * I[n][c][y+S*i][x+S*j]; }}}}}}} for(n=0; n<N; n++) { for(m=0; m<M; m++) { for(y=0; y<I_X; y++) { for(x=0; x<I_Y; x++) { O[n][m][x][y] = ReLU(O[n][m][x][y] ); }}}} for(n=0; n<N; n++) { for(k=0; k<K; k++) { for(y=0; x<X’; y++) { for(x=0; y<Y’; x++) { O[n][k][x][y] = ReLU(O[n][k][x][y] ); for(c=0; c<C; c++) { for(j=0; j<R; j++) { for(i=0; i<S; i++) { O[n][k][x][y] += W[k][c][i][j] * I[n][c][x+i][y+j]; }}}}}}} Loop tiling/blocking Loop unrolling tile Loop interchanging • Loop fission: decide the reconfiguration format of MAERI • Loop tiling: decide the tile shape that is mapped onto MAERI • Loop unrolling: unrolling all the operations inside the kernel onto MAERI • Loop interchanging: decide the execution order of the tiles.
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
Mapping strategy 1 (16 MSes) Str=1 Y=4 Y’=3 S=2 K=2 C=4 C=4 p18 o18 p0 p27 o9 o0 p9 p27 o1 o10 p28 p10 o19 p19 p1 p28 p2 o20 o11 p29 p11 p29 o2 p20 p0 R=2 X’=3 p3 p12 o3 o21 p30 p30 o12 p21 p4 o4 o22 p31 o13 p22 p13 p31 p14 o14 p32 o5 p23 p5 p32 o23 X=4 p15 p33 p24 o24 p33 p6 o6 o15 o25 o16 p7 p16 p34 p25 p34 o7 p26 p35 o26 o17 o8 p35 p17 p8 o0=ReLU(p0) o1=ReLU(p1) …. o35=ReLU(p35) x16 x80 x81 x17 x18 x82 x83 x19 Output 0 Filter 0 Partial sum 0 conv w0 w16 w1 w17 N=2 x0 x32 x64 x65 x33 x1 x66 x2 x34 x3 x67 x35 x20 x84 x21 x85 x22 x86 x87 x23 K=2 w3 w19 w18 w2 Input 0 x4 x36 x68 x37 x5 x69 x70 x6 x38 x7 x71 x39 x88 x24 x89 x25 x26 x90 x27 x91 x115 x113 x114 x112 N=2 x40 x72 x8 x41 x73 x9 x74 x10 x42 x75 x43 x11 x92 x28 x93 x29 x94 x30 x31 x95 x96 x97 x98 x99 x48 x49 x50 x51 w24 w8 w9 w25 x44 x76 x12 x45 x13 x77 x78 x14 x46 x47 x79 x15 x103 x119 x117 x101 x116 x100 x102 x118 x52 x53 x54 x55 T_R=2, T_S=2, T_C=4, T_K=1, T_N=1, T_X’=1, T_Y’=1 w10 w26 w27 w11 w28 w12 w13 w29 w4 w20 w21 w5 x107 x123 x106 x122 x120 x104 x121 x105 x56 x57 x58 x59 Filter 1 w14 w30 w15 w31 Partial Sum 1 Output 1 w22 w6 w7 w23 x126 x110 x127 x111 x124 x108 x125 x109 x60 x61 x62 x63 Input 1
MAERI behavior x19 x51 x39 x35 x55 x23 x7 x3 x18 x50 x38 x34 x22 x2 x6 x54 x16 x33 x32 x21 x1 x0 x17 x5 x4 x20 x53 x52 x49 x48 x37 x36 w6 w14 w13 w12 w11 w10 w9 w8 w7 w5 w4 w3 w2 w1 w0 w15 Distribute Network (DN) VN 0 MS4 MS3 MS0 MS15 MS14 MS13 MS12 MS11 MS10 MS1 MS2 MS5 MS9 MS8 MS7 MS6 Reduce Network (RN) p0 o0
Mapping strategy 2 p0 p27 p27 p9 o18 p18 o0 o9 p54 p36 p45 p63 p55 p46 p28 p37 p1 p10 p28 o10 o1 p64 p19 o19 o20 p20 p11 p2 p56 p65 p47 o2 o11 p38 p29 p29 p9 p0 p30 o21 p21 p3 p30 p48 p12 p66 p57 o3 p39 o12 p67 p31 p4 p31 p49 o13 p22 p58 o4 o22 p40 p13 p5 p41 p59 o23 p32 p50 o5 p23 o14 p68 p14 p32 p69 p15 o15 p33 p6 p33 o24 o6 p60 p24 p51 p42 o25 p52 p43 p34 p7 o16 p70 p16 o7 p61 p34 p25 p35 o8 p35 p17 o17 o26 p53 p26 p44 p71 p8 p62 Output 0 Filter 0 o0=ReLU(p0+p18) o1=ReLU(p1+p19) …. o35=ReLU(p53+p71) x16 x80 x81 x17 x82 x18 x19 x83 conv w0 w16 w17 w1 Partial sum 0 Input 0 x0 x64 x32 x65 x33 x1 x2 x66 x34 x35 x3 x67 x20 x84 x85 x21 x86 x22 x23 x87 w19 w3 w18 w2 x68 x36 x4 x5 x69 x37 x6 x38 x70 x7 x39 x71 x88 x24 x25 x89 x26 x90 x27 x91 x115 x113 x114 x112 x72 x40 x8 x9 x41 x73 x74 x42 x10 x43 x75 x11 x28 x92 x93 x29 x30 x94 x95 x31 x96 x97 x98 x99 x48 x49 x50 x51 w24 w8 w25 w9 x76 x44 x12 x77 x45 x13 x46 x78 x14 x79 x15 x47 x103 x119 x117 x101 x116 x100 x102 x118 x52 x53 x54 x55 w26 w10 w27 w11 w12 w28 w29 w13 T_R=2, T_S=2, T_C=2, T_K=2, T_N=1, T_X’=1, T_Y’=1 w4 w20 w5 w21 x107 x123 x106 x122 x104 x120 x105 x121 Filter 1 x56 x57 x58 x59 w14 w30 w31 w15 w22 w6 w23 w7 x126 x110 x127 x111 x108 x124 x109 x125 Output 1 x60 x61 x62 x63 Partial Sum 1 Input 1
Computing phase 1 MAERI behavior x19 x23 x7 x3 x6 x22 x18 x18 x2 x2 x6 x22 x5 x16 x4 x16 x0 x21 x1 x0 x17 x5 x4 x20 x17 x20 x21 x1 w6 w22 w21 w20 w19 w18 w17 w16 w7 w5 w4 w3 w2 w1 w0 w23 Distribute Network (DN) VN 0 VN 1 MS4 MS3 MS0 MS15 MS14 MS13 MS12 MS11 MS10 MS1 MS2 MS5 MS9 MS8 MS7 MS6 Reduce Network (RN) p9 p0
Computing phase 2 MAERI behavior p30 p12 p29 p11 p27 p9 p26 p33 p15 p32 p14 p31 p13 p10 p8 p28 p3 p7 p24 p6 p23 p5 p22 p4 p21 p20 p2 p19 p1 p18 p0 p25 Distribute Network (DN) VN 6 VN 0 VN 1 VN 2 VN 3 VN 4 VN 5 VN 7 MS4 MS3 MS0 MS15 MS14 MS13 MS12 MS11 MS10 MS1 MS2 MS5 MS9 MS8 MS7 MS6 Reduce Network (RN) p’5 p’4 p’3 p’2 p’1 p’0 p’7 p’6 o7 o1 o2 o3 o4 o5 o6 o0
Mapping strategy 3 o18 p45 p27 o9 p36 o0 p27 p18 p9 p54 p63 p0 p28 p28 p1 o19 p64 o10 p46 p55 p10 p19 p37 o1 o2 p47 p2 p65 o11 p11 p29 p56 p29 o20 p38 p20 p48 p39 o12 p30 p57 o3 p66 p12 p30 o21 p21 p3 o4 p31 p31 p4 p13 p67 p58 o13 o22 p22 p49 p40 p59 p50 p5 p32 p41 p23 o23 p32 p68 p14 o14 o5 p0 o24 p51 p33 p42 p24 p60 p15 o15 p69 o6 p6 p33 p16 p34 p25 o25 p70 p61 o7 p34 p7 p43 o16 p52 p26 p8 p71 p35 o26 p17 o8 o17 p62 p53 p44 p35 x80 x16 x81 x17 x82 x18 x19 x83 Output 0 w0 w0 w16 w1 w17 w1 o0=ReLU(p0+p18) o1=ReLU(p1+p19) …. o35=ReLU(p53+p71) Input 0 Filter 0 x0 x64 x32 x1 x33 x65 x66 x2 x34 x3 x67 x35 x84 x20 x85 x21 x22 x86 x23 x87 Partial sum 0 conv w3 w19 w3 w2 w2 w18 x68 x36 x4 x5 x37 x69 x70 x6 x38 x7 x39 x71 x24 x88 x25 x89 x90 x26 x91 x27 x115 x113 x114 x112 x8 x72 x40 x9 x41 x73 x42 x10 x74 x43 x75 x11 x28 x92 x29 x93 x30 x94 x31 x95 x96 x97 x98 x99 x48 x49 x50 x51 w8 w24 w25 w9 x12 x76 x44 x77 x13 x45 x46 x78 x14 x15 x47 x79 p36 x103 x119 x101 x117 x116 x100 x118 x102 x52 x53 x54 x55 T_R=2, T_S=2, T_C=2, T_K=1, T_N=2, T_X’=1, T_Y’=1 w10 w26 w11 w27 w12 w28 w29 w13 w4 w4 w20 w5 w21 w5 x123 x107 x122 x106 x120 x104 x121 x105 x56 x57 x58 x59 w30 w14 w31 w15 w6 w22 w6 w7 w23 w7 Filter 1 x110 x126 x127 x111 x108 x124 x125 x109 x60 x61 x62 x63 Output 1 Input 1
Mapping strategy 4 8 channels … o9 p9 p27 p63 p72 p90 p81 p18 p0 o18 o0 o1 p73 p1 o10 p82 p10 p28 p64 o19 p91 p19 o11 o2 p83 p92 p29 p74 p2 p65 o20 p20 p11 p9 p30 o21 p12 p30 p3 o12 p21 p57 o3 p48 p75 p67 p49 p58 p76 p13 p22 p4 p31 o4 o13 o22 p23 o5 p77 p95 o14 p14 o23 p86 p5 p32 p68 p0 … p6 o24 p33 o15 p24 p78 p15 p33 o6 p51 p60 p7 p52 p70 p34 p61 o25 p25 o7 p16 p79 o16 o17 p98 p71 o26 p8 p80 p26 p17 p89 o8 p35 x16 x80 x17 x81 x18 x82 x19 x83 Output 0 w0 w0 w16 w16 w1 w17 w1 w17 o0=ReLU(p0+p18+p36+p54) o1=ReLU(p1+p19+p37+p55) …. o35=ReLU(p89+p107+p125+p143) Input 0 Filter 0 x32 x64 x0 x33 x1 x65 x66 x34 x2 x35 x67 x3 x20 x84 x21 x85 x22 x86 x23 x87 conv Partial sum 0 w3 w19 w19 w3 w18 w18 w2 w2 x4 x36 x68 x5 x37 x69 x70 x38 x6 x39 x71 x7 x88 x24 x89 x25 x26 x90 x91 x27 … x115 x113 x114 x112 x40 x8 x72 x73 x9 x41 x42 x10 x74 x11 x75 x43 x28 x92 x29 x93 x30 x94 x95 x31 x96 x97 x98 x99 x48 x49 x50 x51 w24 w8 w25 w9 p81 x76 x44 x12 x45 x77 x13 x78 x14 x46 x47 x15 x79 x119 x103 x101 x117 x116 x100 x102 x118 x52 x53 x54 x55 p72 T_R=2, T_S=2, T_C=1, T_K=2, T_N=2, T_X’=1, T_Y’=1 w10 w26 w27 w11 w28 w12 w29 w13 … w20 w4 w21 w5 x123 x107 x122 x106 x104 x120 x121 x105 x56 x57 x58 x59 w14 w30 w31 w15 w6 w22 w23 w7 Filter 1 x126 x110 x127 x111 x108 x124 x109 x125 x60 x61 x62 x63 Output 1 Partial Sum 1 Input 1 p135 p136 p137 p138 p139 p140 p141 p142 p143
Mapping strategy 5 8 channels … p72 p0 p90 o18 o9 p81 p18 p9 o0 p27 p63 p64 p82 p73 p19 o10 p91 p1 p10 o1 p28 o19 p74 o2 p83 p65 p92 o20 p11 p2 p20 p29 o11 p1 p0 p30 p30 o12 p21 o21 p75 p12 p3 o3 p48 p57 p58 o13 p76 p67 p22 p4 p49 o4 p31 o22 p13 p95 p86 p32 p77 p68 p5 p14 o14 o23 o5 p23 … p3 p4 o24 p15 p24 p33 p51 p60 p33 p78 o15 o6 p6 p16 o7 p34 o25 o16 p70 p79 p52 p61 p7 p25 o8 p98 p80 p89 p71 p26 p35 p17 p8 o26 o17 Output 0 Filter 0 x16 x80 x17 x81 x18 x82 x83 x19 o0=ReLU(p0+p18+p36+p54) o1=ReLU(p1+p19+p37+p55) …. o35=ReLU(p89+p107+p125+p143) Partial sum 0 conv w0 w16 w1 w17 Input 0 x64 x32 x0 x33 x65 x1 x34 x66 x2 x67 x35 x3 x84 x20 x21 x85 x22 x86 x87 x23 w3 w19 w18 w2 x68 x4 x36 x69 x5 x37 x6 x38 x70 x71 x39 x7 x24 x88 x89 x25 x90 x26 x27 x91 … x115 x113 x114 x112 x72 x8 x40 x41 x73 x9 x42 x74 x10 x11 x75 x43 x92 x28 x93 x29 x30 x94 x31 x95 x96 x97 x98 x99 x48 x49 x50 x51 w24 w8 w25 w9 x76 x12 x44 x77 x13 x45 x14 x78 x46 x15 x47 x79 T_R=2, T_S=2, T_C=1, T_K=1, T_N=1, T_X’=2, T_Y’=2 x103 x119 x117 x101 x116 x100 x102 x118 x52 x53 x54 x55 … w26 w10 w11 w27 w28 w12 w13 w29 w4 w20 w5 w21 x123 x107 x122 x106 x120 x104 x121 x105 Filter 1 x56 x57 x58 x59 Output 1 w30 w14 w31 w15 w6 w22 w7 w23 x126 x110 x127 x111 x124 x108 x109 x125 x60 x61 x62 x63 Partial Sum 1 Input 1 p135 p136 p137 p138 p139 p140 p141 p142 p143
MAERI Efficiency Compare Best performance Least energy consumption within single tile Sensitive to DN bandwidth Low utilization rate
Summary mRNA • Mapping DNN on MAERI is to make trade-offs • Utilization rate • Network bandwidth • Reconfiguration time • Network communications • Data reuse pattern
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
Tool flow of mRNA mRNA Layers representation … Front-end parser DNN Model Layer n Layer 2 Layer 1 Analyzer Profile result of each layer Mapping candidates generation MAERI configuration … Layer n Mapping Strategy 1: … Layer 2 Mapping Strategy 1: … Layer 1 Mapping Strategy 1: … Mapping Strategy 2: … Candidates evaluation Energy factors Candidate selection Optimization options Configuration of each layer … Layer n Mapping Strategy 1: … Layer 2 Configuration 1: … Layer 1 Configuration 1: … Configuration 2: … MAERI configuration generator
Tool flow of mRNA mRNA Layers representation … Front-end parser DNN Model Layer n Layer 2 Layer 1 Analyzer Profile result of each layer Mapping candidates generation MAERI configuration … Layer n Mapping Strategy 1: … Layer 2 Mapping Strategy 1: … Layer 1 Mapping Strategy 1: … Mapping Strategy 2: … Candidates evaluation Energy factors Candidate selection Optimization options Configuration of each layer … Layer n Mapping Strategy 1: … Layer 2 Configuration 1: … Layer 1 Configuration 1: … Configuration 2: … MAERI configuration generator
Challenges: Huge search space for(T_N=1; T_N<N; T_N++) { for(T_K=0; T_K<K; T_K++) { for(T_X’=0; T_X’<X’; T_X’++) { for(T_Y’=0; T_Y’<Y’; T_Y’++) { for(T_C=0; T_C<C; T_C’++) { for(T_R=0; T_R<R; T_R++) { for(T_S=0; T_S<S; T_S++) { Mapping_Evaluation(); }}}}}}} Pruning !! For VGGNet Conv 2: I(224, 224, 64, 32), W(3, 3, 64, 64), O(224, 224, 64, 32): MAERI with 256 MSes: 71107 x ! MAERI with 1024 MSes: 531517 x !! Permutations and combinations of tile parameters
Pruning Heuristic 1: Utilization T_R=2 T_R=2 T_S=2 T_S=2 S=3 S=7 S=5 R=3 R=7 R=5 T_S=2 T_R=2 (3X3)/(4X4)=0.56 (5X5)/(6X6)=0.69 (7X7)/(8X8)=0.76 Prune candidates with low average utilization rate Low peak UR: Low filter UR:
Pruning Heuristic 2: DRAM traffic For VGGNet Conv 2: I(224, 224, 64, 1), W(3, 3, 64, 64), O(224, 224, 64, 1): Tile(T_S=3, T_R=3, T_C=28, T_K=1, T_N=1, T_X’=1, T_Y’=1) 9 MB (8bit) Tile(T_S=1, T_R=1, T_C=1, T_K=28, T_N=3, T_X’=3, T_Y’=1) 1.5 GB !! Prune the candidates which generate large size of partial sums
Pruning Heuristic 3: Dimension Reduction , searching space reduced N times • Prune the tile parameters based on realistic scenarios • Inference / low latency real-time inferences • Fully-connected layer, RNN and LSTM
Tool flow of mRNA mRNA Layers representation … Front-end parser DNN Model Layer n Layer 2 Layer 1 Analyzer Profile result of each layer Mapping candidates generation MAERI configuration … Layer n Mapping Strategy 1: … Layer 2 Mapping Strategy 1: … Layer 1 Mapping Strategy 1: … Mapping Strategy 2: … Candidates evaluation Energy factors Candidate selection Optimization options Configuration of each layer … Layer n Mapping Strategy 1: … Layer 2 Configuration 1: … Layer 1 Configuration 1: … Configuration 2: … MAERI configuration generator
Candidates evaluation • Cycle-level runtime calculation • Bandwidth influence of DN, RN • Reconfiguration influence • The number of access on different Memory hierarchies • Between DSes • Between RSes • Between Mses • Local register buffer • On-chip Data Memory read/write • DRAM • The number of operations that are performed • Multiply • Reduce (add or compare) • Multiply the operations with energy factors
Tool flow of mRNA mRNA Layers representation … Front-end parser DNN Model Layer n Layer 2 Layer 1 Analyzer Profile result of each layer Mapping candidates generation MAERI configuration … Layer n Mapping Strategy 1: … Layer 2 Mapping Strategy 1: … Layer 1 Mapping Strategy 1: … Mapping Strategy 2: … Candidates evaluation Energy factors Candidate selection Optimization options Configuration of each layer … Layer n Mapping Strategy 1: … Layer 2 Configuration 1: … Layer 1 Configuration 1: … Configuration 2: … MAERI configuration generator
Candidate selection • Sort all the evaluated candidates according to the optimization option • -energy:minimum energy consumption • -performance: minimum runtime • -energy-efficiency: maximum performance per energy • Print out the profiling result of top n mappings specified in the command • Select the best mapping and send it to configuration generator.
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
Methodolody MAERI Configurations Tested Layers
Performance and Energy MAERI Config Mapping Number of Mses = 512, DN_BW=512, RN_BW=512 Tile(T_S, T_R, T_C, T_K, T_N, T_X’, T_Y’) ResNet CB3a_2
Bandwidth and Scale influence Number of MSes = 512 Blue Bar DN_BW = 16 Green Bar DN_BW = 128 DN_BW = 16 Number of MSes = 64, 128, 256, 512 3,3,1,K,1,1,1 Baseline 3,3,2,7,1,1,1 3,3,4,7,1,1,1 3,3,4,14,1,1,1 ResNet CB3a_2 3,3,1,7,1,1,1
Outline Motivation Target DNN Accelerator: MAERI mRNA: Mapping explorer for MAERI Mapping examples mRNA framework Evaluation Conclusion
Conclusion Thank you and open for the questions !! • Search space of DNN mapping is a high dimensional optimization problem • mRNA: DNN Mapping Space Exploration tool for MAERI • Deep learning domain-specific and MAERI-specific heuristics to reduce the searching space. • Maximize the energy efficiency and/or minimize runtime • Code base (HPCA 2019 Tutorial) • http://synergy.ece.gatech.edu/tools/mrna • Future extensions: • Extend to mapping complex DNN structures: e.g. Inceptions in GoogleNet • Extend to mapping sparse DNNs
Back ups How to set the threshold? Loop ordering ?