Advanced Computer Architecture ML Accelerators: Why?

ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740

AGENDA • Review from last lecture • Single core->multi core->accelerator • ML accelerators: why?

LOGISTICS • Project list • Posted in Piazza • Be prepared to spend time on the project • Sample project proposals from many different years • Posted in Piazza • Project Proposal Due on Feb 11, 2019 • Project Proposal Presentations: Feb 13, 2019 • Can can present using your own laptop • Groups: 1 or 2 students

Project Proposal • Problem: Clearly define what is the problem you are trying to solve • Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? • Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work? • Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea? • Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?

LITERATURE SURVEY • Goal: Critically analyze related work to your project • Pick 2-3 papers related to your project • Use the same format as the reviews • What is the problem the paper is solving • What is the key insight • What are the advantages and disadvantages • How can you do better • Will become the related work in your proposal

FLYNN’S TAXONOMY OF COMPUTERS • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 • SISD: Single instruction operates on single data element • SIMD: Single instruction operates on multiple data elements • Array processor • Vector processor • MISD: Multiple instructions operate on single data element • Closest form: systolic array processor, streaming processor • MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) • Multiprocessor • Multithreaded processor

WHY SYSTOLIC ARCHITECTURES? • Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory • Similar to an assembly line of processing elements • Different people work on the same car • Many cars are assembled simultaneously • Why? Special purpose accelerators/architectures need • Simple, regular design (keep # unique parts small and regular) • High concurrency  high performance • Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS: PROS AND CONS • Advantage: • Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement • Downside: • Specialized  not generally applicable because computation needs to fit the PE functions/organization

MULTI-CORE • Idea: Put multiple processors on the same die. • Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area • What else could you do with the die area you dedicate to multiple processors? • Have a bigger, more powerful core • Have larger caches in the memory hierarchy • Integrate platform components on chip (e.g., network interface, memory controllers)

WHY MULTI-CORE? • Alternative: Bigger, more powerful single core • Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

MULTI-CORE VS. LARGE SUPERSCALAR • Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications • Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

WHY MULTI-CORE? • Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

CACHE VS. CORE

WHY MULTI-CORE? • Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)

Multicore Decade? We have relied on multicore scaling for over five years. ? 2000 2005 2010 2015 i7 980x Hex-Core Pentium Extreme Dual-Core Core 2 Quad-Core How much longer will it be our primary performance scaling technique?

Finding Optimal Multicore Designs For next 5 technology generations, find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks Comprehensive design space: • Fixed area budget • Fixed power budget • Two sets of CMOS scaling projections • Optimal core and diverse multicore organizations • Parallel benchmarks

Symmetric Multicore Projections 18x 3.4x in 10 years Symmetric multicores alone will not sustain the multicore era.

Multicore Solutions Asymmetric Topologies 3.5x

Multicore Solutions Dynamic Topologies 3.5x [Chakraborty (2008), Suleman et al (2009)]

Multicore Solutions Composed/Fused Topologies 3.7x [Ipek et al (2007), Kim et al (2007)]

Multicore Solutions 2.7x

Multicore Era Projections 18x 3.7x The best designs speed up 14% per year rather than the recent trend of 34% per year

WITH MULTIPLE CORES ON CHIP • What we want: • N times the performance with N times the cores when we parallelize an application on N cores • What we get: • Amdahl’s Law (serial bottleneck) • Bottlenecks in the parallel portion

CAVEATS OF PARALLELISM • Amdahl’s Law • f: Parallelizable fraction of a program • N: Number of processors • Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. • Maximum speedup limited by serial portion: Serial bottleneck • Parallel portion is usually not perfectly parallel • Synchronization overhead (e.g., updates to shared data) • Load imbalance overhead (imperfect parallelization) • Resource sharing overhead (contention among N processors) 1 Speedup = f + 1 - f N

THE PROBLEM: SERIALIZED CODE SECTIONS • Many parallel programs cannot be parallelized completely • Causes of serialized code sections • Sequential portions (Amdahl’s “serial part”) • Critical sections • Barriers • Serialized code sections • Reduce performance • Limit scalability • Waste energy

Why Diminishing Returns? Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

Dark Silicon At 8 nm: At 22 nm: 71% 51% 26% Sources of Dark Silicon: Power + Limited Parallelism 17%

Conclusions ? Unicore Era Multicore Era Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power

Efficiency Innovation Specialization

NN Accelerators

How Does the BrainWork? • The basic computational unit of the brain is aneuron • 86B neurons in thebrain • Neurons are connected with nearly 1014 – 1015 synapses • Neurons receive input signal from dendrites and produce output signal along axon, which interact with the dendrites of other neurons via synapticweights • Synaptic weights – learnable & control influencestrength Image Source:Stanford 10

Neural Networks: WeightedSum Image Source:Stanford

Many WeightedSums Image Source:Stanford

What is DeepLearning? “Volvo XC90” Image Image Source: [Lee et al., Comm. ACM2011] 17

Why is Deep Learning HotNow? Big Data Availability GPU Acceleration New ML Techniques 350Mimages uploaded per day 2.5Petabytes of customer datahourly 300 hours of videouploaded everyminute

ImageNetChallenge Image Classification Task: 1.2M training images • 1000 objectcategories Object DetectionTask: 456k training images • 200 objectcategories

ImageNet: Image ClassificationTask Top 5 Classification Error(%) 30 large error ratereduction 25 due to DeepCNN 20 15 10 5 0 2010 2011 Hand-craftedfeature- baseddesigns 2012 2013 2014 2015 Human Deep CNN-baseddesigns [Russakovsky et al., IJCV2015] 20

GPU Usage for ImageNetChallenge

EstablishedApplications • Image • Classification: image to objectclass • Recognition: same as classification (except forfaces) • Detection: assigning bounding boxes toobjects • Segmentation: assigning object class to everypixel • Speech & Language • Speech Recognition: audio to text • Translation • Natural Language Processing: text tomeaning • Audio Generation: text to audio • Games

Deep Learning onGames Google DeepMindAlphaGo

EmergingApplications • Medical (Cancer Detection,Pre-Natal) • Finance (Trading, Energy Forecasting,Risk) • Infrastructure (Structure Safety andTraffic) • Weather Forecasting and Event Detection http://www.nextplatform.com/2016/09/14/next-wave-deep-learning-applications/ 24

Deep Learning for Self-drivingCars

DNN Terminology101 Neurons DNN Terminology101 Image Source:Stanford

DNN Terminology101 Synapses DNN Terminology101 Image Source:Stanford

DNN Terminology101 Each synapse has a weight for neuronactivation ⎛ • ⎞ • Xi ⎟ • ⎠ 3 Yj activation⎜Wij ⎝i1 Y1 W11 X1 Y2 X2 Y3 X3 Y4 W34 Image Source:Stanford

DNN Terminology101 Weight Sharing: multiple synapses use the same weight value ⎛ • ⎞ • Xi ⎟ • ⎠ 3 Yj activation⎜Wij ⎝i1 Y1 W11 X1 Y2 X2 Y3 X3 Y4 W34 Image Source:Stanford

DNN Terminology101 Layer1 L1 Neuronoutputs a.k.a.Activations L1 Neuroninputs e.g. imagepixels Image Source:Stanford

DNN Terminology101 L2 Input Activations Layer2 L2 Output Activations Image Source:Stanford

DNN Terminology101 Fully-Connected: all i/p neurons connected to all o/pneurons Sparsely-Connected Image Source:Stanford

DNN Terminology101 Feedback FeedForward Image Source:Stanford

Advanced Computer Architecture ML Accelerators: Why?

Advanced Computer Architecture ML Accelerators: Why?

Presentation Transcript

Samira Khan University of Virginia Sep 4, 2019

Samira Khan University of Virginia Apr 3, 2019

Samira Khan University of Virginia Jan 23, 2019

Samira Khan University of Virginia Mar 20, 2019

University of Virginia

Samira Khan

University of Virginia

University of Virginia

AGA KHAN UNIVERSITY

Samira Khan University of Virginia Feb 11, 2019

Feb 2019