1 / 71

Samira Khan University of Virginia Feb 11, 2019

ADVANCED COMPUTER ARCHITECTURE ML Accelerators. Samira Khan University of Virginia Feb 11, 2019. The content and concept of this course are adapted from CMU ECE 740. AGENDA. Logistics Review from last lecture ML accelerators Branch prediction basics. LOGISTICS.

smarshall
Download Presentation

Samira Khan University of Virginia Feb 11, 2019

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 11, 2019 The content and concept of this course are adapted from CMU ECE 740

  2. AGENDA • Logistics • Review from last lecture • ML accelerators • Branch prediction basics

  3. LOGISTICS • Most students talked to me about the project • Good job! Many interesting directions • Project Proposal Due on Feb 11, 2019 • Do you need an extension? Feb 14, 2019 • Project Proposal Presentations: Feb 13, 2019 • Unfortunately, no extension for the presentation

  4. NN Accelerator • Convolution is the major bottleneck • Characterized by high parallelism and high reuse • Exploit high parallelism --> a large number of PEs • Exploit data reuse --> • Maximize reuse in local PEs • Next, maximize reuse in neighboring PEs • Minimize accessing the global buffer • Exploit data types, sizes, access patterns, etc.

  5. Spatial Architecture forDNN DRAM • Local MemoryHierarchy • Global Buffer • Direct inter-PEnetwork • PE-local memory(RF) Global Buffer (100 – 500kB) ALU ALU ALU ALU ALU ALU ALU ALU Processing Element(PE) ALU ALU ALU ALU 0.5 – 1.0kB RegFile Control ALU ALU ALU ALU

  6. DataflowTaxonomy • Weight Stationary(WS) • + Good design if weights are significant • + Reuses partial sums (ofmaps) • - Have to broadcast activations (ifmaps) and move psums • Output Stationary(OS) • + Reuses partial sums, activations are passed though each PE, eliminates memory reads • - Weights need to be broadcasted! • No Local Reuse(NLR) • + Partial sums passes though PEs • - No local reuse; A large global buffer is expensive! • - Need to perform multicast operations 17

  7. Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR [Chen et al., ISCA2016]

  8. Energy EfficiencyComparison • Same totalarea • AlexNet CONVlayers • 256PEs • Batch size =16 Variants ofOS 2 1.5 Normalized Energy/MAC 1 0.5 0 OSA OSB OSC CNNDataflows WS NLR Row Stationary [Chen et al., ISCA2016]

  9. Energy-EfficientDataflow: Row Stationary(RS) • Maximize reuse and accumulation atRF • Optimize for overall energy efficiency instead for only a certain datatype [Chen et al., ISCA2016]

  10. Goals 1. Number of MAC operations is significant Want to maximize reuse of psums 2. At the same time, want to maximize reuse of weights and activations that are used to calculate the psums

  11. Row Stationary: Energy-efficientDataflow InputFmap Filter OutputFmap * =

  12. 1D Row Convolution inPE InputFmap Filter PartialSums * = PE RegFile c b a e d c b a

  13. 1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a c b a a PE e d

  14. 1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile b a c b PE e a b

  15. 1D Row Convolution inPE InputFmap Filter PartialSums * = RegFile c b a e d c PE b a c

  16. 1D Row Convolution inPE • Maximize row convolutional reuse inRF • − Keep a filter row and fmap sliding window inRF • Maximize row psum accumulation inRF RegFile c b a e d c PE b a c

  17. 2D Convolution (CONV)Layer outputfmap anoutput inputfmap filter(weights) activation H E R S W F Multiply and accumulate the whole filter How that would look like in 2D row stationary dataflow?

  18. 2D Row Convolution inPE PE1 * Row1 Row1 * = One row of filter, ifmap, ofmap mapped to one PE

  19. 2D Row Convolution inPE PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = Different rows of filter, ifmap are mapped to different PEs

  20. 2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * = They all still accumulate psum for the same row Need to move psum vertically

  21. 2D Row Convolution inPE Row1 PE1 * Row1 Row1 PE2 * Row2 Row2 PE3 * Row3 Row3 * * = = Then the same filter is multiplied to a sliding window of activations to calculate the next row of ofmap

  22. 2D Row Convolution inPE Row1 Row2 PE1 PE4 * * Row1 Row1 Row1 Row2 PE2 PE5 * * Row2 Row2 Row2 Row3 PE3 PE6 * * Row3 Row3 Row3 Row4 * * = = Exploit the spatial architecture and map them in other PEs!

  23. 2D Row Convolution inPE Row1 Row2 Row3 PE1 PE4 PE7 * * * Row1 Row1 Row1 Row2 Row1 Row3 PE2 PE5 PE8 * * * Row2 Row2 Row2 Row3 Row2 Row4 PE3 PE6 PE9 * * * Row3 Row3 Row3 Row4 Row3 Row5 * * * = = = Exploit the spatial architecture and map them in other PEs!

  24. Convolutional ReuseMaximized Row 1 Row 2 Row3 * Row 1 * Row 2 * Row3 * Row 2 * Row 3 * Row4 * Row 3 * Row 4 * Row5 PE1 PE4 PE7 Row1 Row1 Row1 PE2 PE5 PE8 Row2 Row2 Row2 PE3 PE6 PE9 Row3 Row3 Row3 Filter rows are reused across PEshorizontally

  25. Convolutional ReuseMaximized Row 1 Row 2 Row3 Row 1* Row 1* Row 1* Row 2* Row 2* Row 2* Row 3* Row 3* Row 3* PE1 PE4 PE7 Row1 Row2 Row3 PE2 PE5 PE8 Row2 Row3 Row4 PE3 PE6 PE9 Row3 Row4 Row5 Fmap rows are reused across PEsdiagonally

  26. Maximize 2D Accumulation in PEArray Row1 Row2 Row3 PE1 PE4 PE7 Row 1* Row 1 Row 1* Row 2 Row 1* Row3 Row 2* Row 2 Row 2* Row 3 Row 2* Row4 Row 3* Row 3 Row 3* Row 4 Row 3* Row5 PE2 PE5 PE8 PE3 PE6 PE9 Partial sums accumulate across PEsvertically

  27. 2D Row Convolution inPE • Filter rows are reused across PEshorizontally • Fmaprows are reused across PEsdiagonally • Partial sums accumulate across PEsvertically • Pros • 2D row conv avoid reading/writing psum to global buffer and directly passes to the next PE where • Also passes along filter and fmaps to next PEs • Cons • How to orchestrate the psums, activations and weights?

  28. Convolution (CONV)Layer Many Input fmaps(N) Many Output fmaps(N) C filters M C H E R 1 1 S W F … … … C C R E H N N S F W Our convolution is 4D!

  29. Multiple layers and channels MultipleFmaps 1 M C E H C F H R C M R H E H F Reuse: Filterweights

  30. Multiple layers and channels MultipleFmaps MultipleFilters 1 2 M M C C E E H C R C F F H R H R C C M R H R E H R H F Reuse: Filterweights Reuse:Activations

  31. Multiple layers and channels MultipleFmaps MultipleFilters MultipleChannels 1 2 3 M C M C E E H C C R C C F F H R H H R R C C M M R R H H R E H E R F H F Reuse: Filterweights Reuse:Activations Reuse:Partial sums

  32. Dimensions Beyond 2DConvolution MultipleFmaps MultipleFilters MultipleChannels 1 2 3

  33. Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Filter1 Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E H F

  34. Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels M C Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 H E share the same filterrow H F

  35. Filter Reuse inPE MultipleFmaps 1 2 MultipleFilters 3 MultipleChannels C M Filter1 Filter1 Fmap1 Psum1 * = Channel1 Row1 Row1 Row1 E H C F H R C M Fmap2 Psum2 R * = Channel1 Row1 Row1 Row1 E H share the same filterrow F H Processing in PE: concatenate fmaprows Filter 1 Fmap 1 &2 Psum 1 &2 * = Channel1 Row1 Row1 Row1 Row1 Row1

  36. Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R

  37. Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow

  38. Fmap Reuse inPE MultipleFilters 2 1 MultipleFmaps 3 MultipleChannels M Filter1 Fmap1 Psum1 C * = Channel1 Row1 Row1 Row1 E C R F R H C Filter2 Fmap1 Psum2 * = H R Channel1 Row1 Row1 Row1 R share the same fmaprow Processing in PE: interleave filterrows Filter 1& 2 Fmap1 Psum 1 &2 * = Channel1 Row1

  39. Channel Accumulation inPE MultipleChannels Fmap 1 Psum1 3 1 MultipleFmaps 2 MultipleFilters M Filter1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 Psum1 R * = H Channel2 Row1 Row1 Row1

  40. Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums + = Row1 Row1 Row1

  41. Channel Accumulation inPE MultipleChannels 3 1 MultipleFmaps 2 MultipleFilters M Psum1 Psum1 Filter1 Fmap1 * = Channel1 Row1 Row1 Row1 E C C F H R Filter1 Fmap1 R * = H Channel2 Row1 Row1 Row1 accumulatepsums Processing in PE: interleavechannels Filter 1 Fmap1 Psum * = Channel 1 &2 Row1

  42. DNN Processing – The FullPicture Filter1 Psum 1 &2 Fmap 1 &2 Image * Multiplefmaps: = Filter 1 &2 Psum 1 &2 Fmap1 Image Multiplefilters: * = Filter1 Fmap1 Psum Image * Multiplechannels: = Map rows from multiple fmaps, filters and channels to same PE to exploit other forms of reuse and local accumulation 52

  43. Optimal Mapping in RowStationary CNNConfigurations C M C Optimization Compiler (Mapper) H E R 1 1 1 R H E … … … C C R E H M N R N E H Row StationaryMapping HardwareResources PE PE PE GlobalBuffer Row 1* Row 1* Row 1* Row1 Row2 Row3 PE PE PE ALU ALU ALU ALU Row 2* Row 2* Row 2* Row2 Row3 Row4 PE PE PE Row 3* Row 3* Row 3* Row3 Row4 Row5 ALU ALU ALU ALU Filter1 1 &2 Psum 1 &2 Image Fmap ALU ALU ALU ALU Multiplefmaps: * = Filter 1 &2 1 Psum 1 &2 * * Image Fmap Multiplefilters: = ALU ALU ALU ALU Filter1 1 Psum Image Fmap Multiplechannels: = [Chen et al., ISCA2016] 53

  44. Computer ArchitectureAnalogy Compilation DNN Shape andSize (Program) Execution Processed Data Mapping (Binary) Input Data [Chen et al., Micro Top-Picks2017] 54

  45. Dataflow Simulation Results

  46. Evaluate Reuse in DifferentDataflows • WeightStationary • Minimize movement of filterweights • OutputStationary • Minimize movement of partialsums • No Local Reuse • No PE local storage. Maximize global buffersize. • Row Stationary Normalized EnergyCost* EvaluationSetup 1×(Reference) • same totalarea • 256PEs • AlexNet • batch size =16 ALU 1× RF ALU 2× PE ALU 6× Buffer ALU 200× DRAM ALU

  47. Variants of OutputStationary

  48. Dataflow Comparison: CONVLayers 2 psums weights activations 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS optimizes for the best overall energyefficiency [Chen et al., ISCA2016]

  49. Dataflow Comparison: CONVLayers 2 ALU RF NoC buffer DRAM 1.5 Normalized Energy/MAC 1 0.5 0 WS OSA OSB OSC CNNDataflows NLR RS RS uses 1.4× – 2.5× lower energy than otherdataflows [Chen et al., ISCA2016]

  50. Hardware Architecture for RSDataflow [Chen et al., ISSCC2016]

More Related