710 likes | 724 Views
Explore ML accelerators, spatial architecture for DNN, energy efficiency, data flow taxonomy, and goals. Learn to maximize row convolutional reuse and energy efficiency.
E N D
ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 11, 2019 The content and concept of this course are adapted from CMU ECE 740
AGENDA • Logistics • Review from last lecture • ML accelerators • Branch prediction basics
LOGISTICS • Most students talked to me about the project • Good job! Many interesting directions • Project Proposal Due on Feb 11, 2019 • Do you need an extension? Feb 14, 2019 • Project Proposal Presentations: Feb 13, 2019 • Unfortunately, no extension for the presentation
NN Accelerator • Convolution is the major bottleneck • Characterized by high parallelism and high reuse • Exploit high parallelism --> a large number of PEs • Exploit data reuse --> • Maximize reuse in local PEs • Next, maximize reuse in neighboring PEs • Minimize accessing the global buffer • Exploit data types, sizes, access patterns, etc.
Spatial Architecture forDNN DRAM • Local MemoryHierarchy • Global Buffer • Direct inter-PEnetwork • PE-local memory(RF) Global Buffer (100 – 500kB) ALU ALU ALU ALU ALU ALU ALU ALU Processing Element(PE) ALU ALU ALU ALU 0.5 – 1.0kB RegFile Control ALU ALU ALU ALU
DataflowTaxonomy • Weight Stationary(WS) • + Good design if weights are significant • + Reuses partial sums (ofmaps) • - Have to broadcast activations (ifmaps) and move psums • Output Stationary(OS) • + Reuses partial sums, activations are passed though each PE, eliminates memory reads • - Weights need to be broadcasted! • No Local Reuse(NLR) • + Partial sums passes though PEs • - No local reuse; A large global buffer is expensive! • - Need to perform multicast operations 17
Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR [Chen et al., ISCA2016]
Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR Row Stationary [Chen et al., ISCA2016]
Energy-EfficientDataflow: Row Stationary(RS) • Maximize reuse and accumulation atRF • Optimize for overall energy efficiency instead for only a certain datatype [Chen et al., ISCA2016]
Goals 1. Number of MAC operations is significant Want to maximize reuse of psums 2. At the same time, want to maximize reuse of weights and activations that are used to calculate the psums
Row Stationary: Energy-efficientDataflow InputFmap Filter OutputFmap * =
1D Row Convolution inPE InputFmap Filter PartialSums * = PE RegFile c b a e d c b a
1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a c b a a PE e d
1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile b a c b PE e a b
1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a e d c PE b a c
1D Row Convolution inPE • Maximize row convolutional reuse inRF • − Keep a filter row and fmap sliding window inRF • Maximize row psum accumulation inRF RegFile c b a e d c PE b a c
2D Convolution (CONV)Layer outputfmap anoutput inputfmap filter(weights) activation H E R S W F Multiply and accumulate the whole filter How that would look like in 2D row stationary dataflow?
2D Row Convolution inPE PE1 * Row1 Row1 * = One row of filter, ifmap, ofmap mapped to one PE
2D Row Convolution inPE PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = Different rows of filter, ifmap are mapped to different PEs
2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = They all still accumulate psum for the same row Need to move psum vertically
2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * * = = Then the same filter is multiplied to a sliding window of activations to calculate the next row of ofmap
2D Row Convolution inPE Row1 Row2 PE1 PE4 * * Row1 Row1 Row1 Row2 PE2 PE5 * * Row2 Row2 Row2 Row3 PE3 PE6 * * Row3 Row3 Row3 Row4 * * = = Exploit the spatial architecture and map them in other PEs!
2D Row Convolution inPE Row1 Row2 Row3 PE1 PE4 PE7 * * * Row1 Row1 Row1 Row2 Row1 Row3 PE2 PE5 PE8 * * * Row2 Row2 Row2 Row3 Row2 Row4 PE3 PE6 PE9 * * * Row3 Row3 Row3 Row4 Row3 Row5 * * * = = = Exploit the spatial architecture and map them in other PEs!
Convolutional ReuseMaximized Row 1 Row 2 Row3 * Row 1 * Row 2 * Row3 * Row 2 * Row 3 * Row4 * Row 3 * Row 4 * Row5 PE1 PE4 PE7 Row1 Row1 Row1 PE2 PE5 PE8 Row2 Row2 Row2 PE3 PE6 PE9 Row3 Row3 Row3 Filter rows are reused across PEshorizontally
Convolutional ReuseMaximized Row 1 Row 2 Row3 Row 1* Row 1* Row 1* Row 2* Row 2* Row 2* Row 3* Row 3* Row 3* PE1 PE4 PE7 Row1 Row2 Row3 PE2 PE5 PE8 Row2 Row3 Row4 PE3 PE6 PE9 Row3 Row4 Row5 Fmap rows are reused across PEsdiagonally
Maximize 2D Accumulation in PEArray Row1 Row2 Row3 PE1 PE4 PE7 Row 1* Row 1 Row 1* Row 2 Row 1* Row3 Row 2* Row 2 Row 2* Row 3 Row 2* Row4 Row 3* Row 3 Row 3* Row 4 Row 3* Row5 PE2 PE5 PE8 PE3 PE6 PE9 Partial sums accumulate across PEsvertically
2D Row Convolution inPE • Filter rows are reused across PEshorizontally • Fmaprows are reused across PEsdiagonally • Partial sums accumulate across PEsvertically • Pros • 2D row conv avoid reading/writing psum to global buffer and directly passes to the next PE where • Also passes along filter and fmaps to next PEs • Cons • How to orchestrate the psums, activations and weights?
Convolution (CONV)Layer Many Input fmaps(N) Many Output fmaps(N) C filters M C H E R 1 1 S W F … … … C C R E H N N S F W Our convolution is 4D!
Multiple layers and channels MultipleFmaps 1 M C E H C F H R C M R H E H F Reuse: Filterweights
Multiple layers and channels MultipleFmaps MultipleFilters 1 2 M M C C E E H C R C F F H R H R C C M R H R E H R H F Reuse: Filterweights Reuse:Activations
Multiple layers and channels MultipleFmaps MultipleFilters MultipleChannels 1 2 3 M C M C E E H C C R C C F F H R H H R R C C M M R R H H R E H E R F H F Reuse: Filterweights Reuse:Activations Reuse:Partial sums
Dimensions Beyond 2DConvolution MultipleFmaps MultipleFilters MultipleChannels 1 2 3
Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Filter1 Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E H F
Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E share the same filterrow H F
Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels C M Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 E H share the same filterrow F H Processing in PE: concatenate fmaprows Filter 1 Fmap 1 &2 Psum 1 &2 * = Channel1 Row1 Row1 Row1 Row1 Row1
Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R
Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow
Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow Processing in PE: interleave filterrows Filter 1& 2 Fmap1 Psum 1 &2 * = Channel1 Row1
Channel Accumulation inPE MultipleChannels Fmap 1 Psum1 3 1 MultipleFmaps 2 MultipleFilters M Filter1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 Psum1 R * = H Channel2 Row1 Row1 Row1
Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums + = Row1 Row1 Row1
Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums Processing in PE: interleavechannels Filter 1 Fmap1 Psum * = Channel 1 &2 Row1
DNN Processing – The FullPicture Filter1 Psum 1 &2 Fmap 1 &2 Image * Multiplefmaps: = Filter 1 &2 Psum 1 &2 Fmap1 Image Multiplefilters: * = Filter1 Fmap1 Psum Image * Multiplechannels: = Map rows from multiple fmaps, filters and channels to same PE to exploit other forms of reuse and local accumulation 52
Optimal Mapping in RowStationary CNNConfigurations C M C Optimization Compiler (Mapper) H E R 1 1 1 R H E … … … C C R E H M N R N E H Row StationaryMapping HardwareResources PE PE PE GlobalBuffer Row 1* Row 1* Row 1* Row1 Row2 Row3 PE PE PE ALU ALU ALU ALU Row 2* Row 2* Row 2* Row2 Row3 Row4 PE PE PE Row 3* Row 3* Row 3* Row3 Row4 Row5 ALU ALU ALU ALU Filter1 1 &2 Psum 1 &2 Image Fmap ALU ALU ALU ALU Multiplefmaps: * = Filter 1 &2 1 Psum 1 &2 * * Image Fmap Multiplefilters: = ALU ALU ALU ALU Filter1 1 Psum Image Fmap Multiplechannels: = [Chen et al., ISCA2016] 53
Computer ArchitectureAnalogy Compilation DNN Shape andSize (Program) Execution Processed Data Mapping (Binary) Input Data [Chen et al., Micro Top-Picks2017] 54
Evaluate Reuse in DifferentDataflows • WeightStationary • Minimize movement of filterweights • OutputStationary • Minimize movement of partialsums • No Local Reuse • No PE local storage. Maximize global buffersize. • Row Stationary Normalized EnergyCost* EvaluationSetup 1×(Reference) • same totalarea • 256PEs • AlexNet • batch size =16 ALU 1× RF ALU 2× PE ALU 6× Buffer ALU 200× DRAM ALU
Dataflow Comparison: CONVLayers 2 psums weights activations 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS optimizes for the best overall energyefficiency [Chen et al., ISCA2016]
Dataflow Comparison: CONVLayers 2 ALU RF NoC buffer DRAM 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS uses 1.4× – 2.5× lower energy than otherdataflows [Chen et al., ISCA2016]
Hardware Architecture for RSDataflow [Chen et al., ISSCC2016]