370 likes | 595 Views
A lgorithms and S pecializers for P rovably Optimal I mplementations with R esiliency and E fficiency. Elad Alon , Krste Asanovic (Director) , Jonathan Bachrach , Jim Demmel , Armando Fox, Kurt Keutzer , Borivoje Nikolic , David Patterson, Koushik Sen , John Wawrzynek
E N D
Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency EladAlon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, BorivojeNikolic, David Patterson, KoushikSen, John Wawrzynek krste@eecs.berkeley.edu http://aspire.eecs.berkeley.edu
Future Application Drivers Augmented Reality Pervasive Speech Robotics BIG DATA Social Networks Environment Personalized Medicine
Compute Energy “Iron Law” • When power is constrained, need better energy efficiency for more performance • Where performance is constrained (real-time), want better energy efficiency to lower power Improving energy Efficiency is critical goal for all future systems and workloads Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)
Good News: Moore’s Law Continues Cheaper! More Transistors/Chip “Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965
Bad News:Dennard (Voltage) Scaling Over Moore, ISSCC Keynote, 2003 Dennard Scaling Post-Dennard Scaling
1st Impact of End of Scaling:End of Sequential Processor Era
Parallelism:A one-time gain Use more, slower cores for better energy efficiency. Either • simpler cores, or • run cores at lower Vdd/frequency • Even simpler general-purpose microarchitectures? • Limited by smallest sensible core • Even Lower Vdd/Frequency? • Limited by Vdd/Vt scaling, errors • Now what?
2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency! [Muller, ARM CTO, 2009] • No savior device technology on horizon. • Future energy-efficiency innovations must be above transistor level.
The End of General-Purpose Processors? • Most computing happens in specialized, heterogeneous processors • Can be 100-1000X more efficient than general-purpose processor • Challenges: • Hardware design costs • Software development costs NVIDIA Tegra2
The Real Scaling Challenge:Communication As transistors become smaller and cheaper, communication dominates performance and energy All scales: • Across chip • Up and down memory hierarchy • Chip-to-chip • Board-to-board • Rack-to-rack
ASPIRE: From Better to Best Specialize and optimize communication and computation across whole stack from applications to hardware • What is the best we can do? • For a fixed target technology (e.g., 7nm) • Can we prove a bound? • Can we design implementation approaching bound? Provably Optimal Implementations
Communication-Avoiding Algorithms: Algorithm Cost Measures CPU Cache CPU DRAM CPU DRAM DRAM CPU DRAM CPU DRAM • Arithmetic (FLOPS) • Communication: moving data between • levels of a memory hierarchy (sequential case) • processors over a network (parallel case).
A few examples of speedups • Matrix multiplication • Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication • QR decomposition (used in least squares, data mining, …) • Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10 • Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200 • Up to 13x on Tesla C2050 / Fermi, for 110k x 100 • Up to 4x on Grid of 4 cities (Dongarra, Langou et al) • “infinite speedup” for out-of-core on PowerPC laptop • LAPACK thrashed virtual memory, didn’t finish • Eigenvalues of band symmetric matrices • Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential) • Iterative sparse linear equations solvers (GMRES) • Up to 4.3x on Intel Clovertown, 8 core • N-body (direct particle interactions with cutoff distance) • Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.
Early Result:Perfect Strong Scaling in Time and Energy • Every time you add processor, use its memory M too • Start with minimal number of procs: PM = 3n2 • Increase P by factor c total memory increases by factor c • Notation for timing model: • γt , βt , αt = secs per flop, per word_moved, per message of size m T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c • Notation for energy model: • γe , βe , αe = Joules for same operations • δe = Joules per word of memory used per sec • εe = Joules per sec for leakage, etc. E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P) • Perfect scaling extends to n-body, Strassen, … [IPDPS, 2013]
C-A Algorithms Not Just for HPC • In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, … • Initial results on lower bounds of database join algorithms
From C-A Algorithms to Provably Optimal Systems? • 1) Prove lower bounds on communication for a computation • 2) Develop algorithm that achieves lower bound on a system • 3) Find that communication time/energy cost is >90% of resulting implementation • 4) We know we’re within 10% of optimal! • Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs
ESP: An Applications Processor Architecture for ASPIRE Intel Ivy Bridge (22nm) • Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore • Well-known how to customize hardware engines for specific task • ESP challenge is using specialized engines for general-purpose code ESP Qualcomm Snapdragon MSM8960 (28nm) ESP
ESP: Ensembles of Specialized Processors • General-purpose hardware, flexible but inefficient • Fixed-function hardware, efficient but inflexible • Par Lab Insight: Patterns capture common operations across many applications, each with unique communication& computation structure • Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs • Bet: Will give us efficiency plus flexibility • Any given core can have a different mix of these depending on workload
Par Lab: Motifs common across apps Audio Recognition Scene Analysis Object Recognition Applications … Dense Sparse Graph Berkeley View “Dwarfs” or Motifs 24
Motif (nee “Dwarf”) Popularity (Red Hot/Blue Cool) Computing Domains Par Lab Apps 25
Architecting Parallel Software Application Identify the Key Computations Identify the Software Structure • Graph Algorithms • Dynamic programming • Dense/Spare Linear Algebra • Un/Structured Grids • Graphical Models • Finite State Machines • Backtrack Branch-and-Bound • N-Body Methods • Circuits • Spectral Methods • Monte-Carlo • Pipe-and-Filter • Agent-and-Repository • Event-based • Bulk Synchronous • Map-Reduce • Layered Systems • Model-view controller • Arbitrary Task Graphs • Puppeteer • Model-View-Controller
Mapping Software to ESP: Specializers Scene Analysis Audio Recognition Object Recognition Applications • Capture desired functionality at high-level using patterns in a productive high-level language • Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code • ASP specializer infrastructure, open-source download … Berkeley View “Dwarfs” or Motifs Dense Sparse Graph Specializers with SEJITS Implementations and Autotuning ESP Code Glue Code Dense Code SparseCode Graph Code ESP Core ILP Engine Dense Engine Sparse Engine Graph Engine
Replacing Fixed Accelerators with Programmable Fabric Intel Ivy Bridge (22nm) • Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore • Fabric challenge is retaining extreme energy efficiency while retaining programmability Fabric Fabric Qualcomm Snapdragon MSM8960 (28nm) Fabric Fabric
Strawman Fabric Architecture • Will never have a C compiler • Only programmed using pattern-based DSLs • More dynamic, less static than earlier approaches • Dynamic dataflow-driven execution • Dynamic routing • Large memory support R R R R R R R R R R R R R R R R M M M M M M M M M M M M M M M M A A A A A A A A A A A A A A A A
“Agile Hardware” Development • Current hardware design slow and arduous • But now have huge design space to explore • How to examine many design points efficiently? • Build parameterized generators, not point designs! • Adopt and adapt best practices from Agile Software • Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”) • Incremental feature addition • Test & Verification first step
Chisel: Constructing Hardware In a Scala Embedded Language • Embed a hardware-description language in Scala, using Scala’s extension facilities • A hardware module is just a data structure in Scala • Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation • Full power of Scala for writing hardware generators • Object-Oriented: Factory objects, traits, overloading etc • Functional: Higher-order funcs, anonymous funcs, currying • Compiles to JVM: Good performance, Java interoperability
Chisel Design Flow Chisel Program Scala/JVM FPGA Verilog ASIC Verilog C++ code C++ Compiler FPGA Tools ASIC Tools Software Simulator FPGA Emulation GDS Layout
Chisel is much more than an HDL • The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL • But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric • Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits) • Only ~6,000 lines of code in current version including libraries! • BSD-licensed open source at: chisel.eecs.berkeley.edu
Many processor tapeouts in few years with small group (45nm, 28nm) Processor Site Clock test site CORE 0 VC0 CORE 2 VC2 DCDC test site 512KB L2 VFIXED Test Sites CORE 1 VC1 CORE 3 VC3 SRAM test site
Resilient Circuits & Modeling • Future scaled technologies have high variability but want to run with lowest-possible margins to save energy • Significant increase in soft errors, need resilient systems • Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect. Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors
Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Audio Recognition Scene Analysis Object Recognition Applications Pipe&Filter Computational and Structural Patterns … … Software Dense Sparse Graph C-A GEMM C-ASpMV C-A BFS Communication-Avoiding Algorithms Specializers with SEJITS Implementations and Autotuning Map-Reduce ESP Code Glue Code Dense Code SparseCode Graph Code ESP (Ensembles of Specialized Processors) Architecture ESP Core ILP Engine Dense Engine Sparse Engine Graph Engine Hardware Cache Coherence Local Stores + DMA Hardware … Hardware Generators using Chisel HDL Deep HW/SW Design-Space Exploration C++ Simulation FPGA Emulation FPGA Computer ASIC SoC Validation/Verification Implementation Technologies
ASPIRE Project • Initial $15.6M/5.5 year funding from DARPA PERFECT program • Started 9/28/2012 • Located in Par Lab space + BWRC • Looking for industrial affiliates (see Krste!) • Open House today, 5th floor Soda Hall Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.