Advanced Architecture for ML Accelerators - University of Virginia

ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 11, 2019 The content and concept of this course are adapted from CMU ECE 740

AGENDA • Logistics • Review from last lecture • ML accelerators • Branch prediction basics

LOGISTICS • Most students talked to me about the project • Good job! Many interesting directions • Project Proposal Due on Feb 11, 2019 • Do you need an extension? Feb 14, 2019 • Project Proposal Presentations: Feb 13, 2019 • Unfortunately, no extension for the presentation

NN Accelerator • Convolution is the major bottleneck • Characterized by high parallelism and high reuse • Exploit high parallelism --> a large number of PEs • Exploit data reuse --> • Maximize reuse in local PEs • Next, maximize reuse in neighboring PEs • Minimize accessing the global buffer • Exploit data types, sizes, access patterns, etc.

Spatial Architecture forDNN DRAM • Local MemoryHierarchy • Global Buffer • Direct inter-PEnetwork • PE-local memory(RF) Global Buffer (100 – 500kB) ALU ALU ALU ALU ALU ALU ALU ALU Processing Element(PE) ALU ALU ALU ALU 0.5 – 1.0kB RegFile Control ALU ALU ALU ALU

DataflowTaxonomy • Weight Stationary(WS) • + Good design if weights are significant • + Reuses partial sums (ofmaps) • - Have to broadcast activations (ifmaps) and move psums • Output Stationary(OS) • + Reuses partial sums, activations are passed though each PE, eliminates memory reads • - Weights need to be broadcasted! • No Local Reuse(NLR) • + Partial sums passes though PEs • - No local reuse; A large global buffer is expensive! • - Need to perform multicast operations 17

Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR [Chen et al., ISCA2016]

Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR Row Stationary [Chen et al., ISCA2016]

Energy-EfficientDataflow: Row Stationary(RS) • Maximize reuse and accumulation atRF • Optimize for overall energy efficiency instead for only a certain datatype [Chen et al., ISCA2016]

Goals 1. Number of MAC operations is significant Want to maximize reuse of psums 2. At the same time, want to maximize reuse of weights and activations that are used to calculate the psums

Row Stationary: Energy-efficientDataflow InputFmap Filter OutputFmap * =

1D Row Convolution inPE InputFmap Filter PartialSums * = PE RegFile c b a e d c b a

1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a c b a a PE e d

1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile b a c b PE e a b

1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a e d c PE b a c

1D Row Convolution inPE • Maximize row convolutional reuse inRF • − Keep a filter row and fmap sliding window inRF • Maximize row psum accumulation inRF RegFile c b a e d c PE b a c

2D Convolution (CONV)Layer outputfmap anoutput inputfmap filter(weights) activation H E R S W F Multiply and accumulate the whole filter How that would look like in 2D row stationary dataflow?

2D Row Convolution inPE PE1 * Row1 Row1 * = One row of filter, ifmap, ofmap mapped to one PE

2D Row Convolution inPE PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = Different rows of filter, ifmap are mapped to different PEs

2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = They all still accumulate psum for the same row Need to move psum vertically

2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * * = = Then the same filter is multiplied to a sliding window of activations to calculate the next row of ofmap

2D Row Convolution inPE Row1 Row2 PE1 PE4 * * Row1 Row1 Row1 Row2 PE2 PE5 * * Row2 Row2 Row2 Row3 PE3 PE6 * * Row3 Row3 Row3 Row4 * * = = Exploit the spatial architecture and map them in other PEs!

2D Row Convolution inPE Row1 Row2 Row3 PE1 PE4 PE7 * * * Row1 Row1 Row1 Row2 Row1 Row3 PE2 PE5 PE8 * * * Row2 Row2 Row2 Row3 Row2 Row4 PE3 PE6 PE9 * * * Row3 Row3 Row3 Row4 Row3 Row5 * * * = = = Exploit the spatial architecture and map them in other PEs!

Convolutional ReuseMaximized Row 1 Row 2 Row3 * Row 1 * Row 2 * Row3 * Row 2 * Row 3 * Row4 * Row 3 * Row 4 * Row5 PE1 PE4 PE7 Row1 Row1 Row1 PE2 PE5 PE8 Row2 Row2 Row2 PE3 PE6 PE9 Row3 Row3 Row3 Filter rows are reused across PEshorizontally

Convolutional ReuseMaximized Row 1 Row 2 Row3 Row 1* Row 1* Row 1* Row 2* Row 2* Row 2* Row 3* Row 3* Row 3* PE1 PE4 PE7 Row1 Row2 Row3 PE2 PE5 PE8 Row2 Row3 Row4 PE3 PE6 PE9 Row3 Row4 Row5 Fmap rows are reused across PEsdiagonally

Maximize 2D Accumulation in PEArray Row1 Row2 Row3 PE1 PE4 PE7 Row 1* Row 1 Row 1* Row 2 Row 1* Row3 Row 2* Row 2 Row 2* Row 3 Row 2* Row4 Row 3* Row 3 Row 3* Row 4 Row 3* Row5 PE2 PE5 PE8 PE3 PE6 PE9 Partial sums accumulate across PEsvertically

2D Row Convolution inPE • Filter rows are reused across PEshorizontally • Fmaprows are reused across PEsdiagonally • Partial sums accumulate across PEsvertically • Pros • 2D row conv avoid reading/writing psum to global buffer and directly passes to the next PE where • Also passes along filter and fmaps to next PEs • Cons • How to orchestrate the psums, activations and weights?

Convolution (CONV)Layer Many Input fmaps(N) Many Output fmaps(N) C filters M C H E R 1 1 S W F … … … C C R E H N N S F W Our convolution is 4D!

Multiple layers and channels MultipleFmaps 1 M C E H C F H R C M R H E H F Reuse: Filterweights

Multiple layers and channels MultipleFmaps MultipleFilters 1 2 M M C C E E H C R C F F H R H R C C M R H R E H R H F Reuse: Filterweights Reuse:Activations

Multiple layers and channels MultipleFmaps MultipleFilters MultipleChannels 1 2 3 M C M C E E H C C R C C F F H R H H R R C C M M R R H H R E H E R F H F Reuse: Filterweights Reuse:Activations Reuse:Partial sums

Dimensions Beyond 2DConvolution MultipleFmaps MultipleFilters MultipleChannels 1 2 3

Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Filter1 Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E H F

Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E share the same filterrow H F

Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels C M Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 E H share the same filterrow F H Processing in PE: concatenate fmaprows Filter 1 Fmap 1 &2 Psum 1 &2 * = Channel1 Row1 Row1 Row1 Row1 Row1

Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R

Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow

Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow Processing in PE: interleave filterrows Filter 1& 2 Fmap1 Psum 1 &2 * = Channel1 Row1

Channel Accumulation inPE MultipleChannels Fmap 1 Psum1 3 1 MultipleFmaps 2 MultipleFilters M Filter1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 Psum1 R * = H Channel2 Row1 Row1 Row1

Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums + = Row1 Row1 Row1

Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums Processing in PE: interleavechannels Filter 1 Fmap1 Psum * = Channel 1 &2 Row1

DNN Processing – The FullPicture Filter1 Psum 1 &2 Fmap 1 &2 Image * Multiplefmaps: = Filter 1 &2 Psum 1 &2 Fmap1 Image Multiplefilters: * = Filter1 Fmap1 Psum Image * Multiplechannels: = Map rows from multiple fmaps, filters and channels to same PE to exploit other forms of reuse and local accumulation 52

Optimal Mapping in RowStationary CNNConfigurations C M C Optimization Compiler (Mapper) H E R 1 1 1 R H E … … … C C R E H M N R N E H Row StationaryMapping HardwareResources PE PE PE GlobalBuffer Row 1* Row 1* Row 1* Row1 Row2 Row3 PE PE PE ALU ALU ALU ALU Row 2* Row 2* Row 2* Row2 Row3 Row4 PE PE PE Row 3* Row 3* Row 3* Row3 Row4 Row5 ALU ALU ALU ALU Filter1 1 &2 Psum 1 &2 Image Fmap ALU ALU ALU ALU Multiplefmaps: * = Filter 1 &2 1 Psum 1 &2 * * Image Fmap Multiplefilters: = ALU ALU ALU ALU Filter1 1 Psum Image Fmap Multiplechannels: = [Chen et al., ISCA2016] 53

Computer ArchitectureAnalogy Compilation DNN Shape andSize (Program) Execution Processed Data Mapping (Binary) Input Data [Chen et al., Micro Top-Picks2017] 54

Dataflow Simulation Results

Evaluate Reuse in DifferentDataflows • WeightStationary • Minimize movement of filterweights • OutputStationary • Minimize movement of partialsums • No Local Reuse • No PE local storage. Maximize global buffersize. • Row Stationary Normalized EnergyCost* EvaluationSetup 1×(Reference) • same totalarea • 256PEs • AlexNet • batch size =16 ALU 1× RF ALU 2× PE ALU 6× Buffer ALU 200× DRAM ALU

Variants of OutputStationary

Dataflow Comparison: CONVLayers 2 psums weights activations 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS optimizes for the best overall energyefficiency [Chen et al., ISCA2016]

Dataflow Comparison: CONVLayers 2 ALU RF NoC buffer DRAM 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS uses 1.4× – 2.5× lower energy than otherdataflows [Chen et al., ISCA2016]

Hardware Architecture for RSDataflow [Chen et al., ISSCC2016]

Advanced Architecture for ML Accelerators - University of Virginia

Advanced Architecture for ML Accelerators - University of Virginia

Presentation Transcript

Samira Khan University of Virginia Sep 4, 2019

Samira Khan University of Virginia Apr 3, 2019

Samira Khan University of Virginia Jan 23, 2019

Samira Khan University of Virginia Mar 20, 2019

University of Virginia

Samira Khan

University of Virginia

University of Virginia

Mcx Daily Report 11 Feb 2019

Feb 2019