610 likes | 755 Views
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?. Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson , Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley Par Lab
E N D
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson, Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley Par Lab June 23, 2010
The Transition to Multicore Sequential App Performance
P.S. Multicore Revolution Could Fail • John Hennessy, President, Stanford University:“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 1/07. • 100% failure rate of Parallel Computer Companies • Ardent, Convex, Encore, Inmos (Transputer), MasPar,nCUBE, Kendall Square Research, Sequent, Tandem, Thinking Machines • What if IT goes from a growthindustry to a replacement industry? • If SW can’t effectively use 32, 64, ... cores per chip => SW no faster on new computer => Only buy if computer wears out or some people buy cheaper PC (netbook)
Need a Fresh Approach to Parallelism Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Eric Brewer, RasBodik, Jim Demmel, Kurt Keutzer,John Kubiatowicz, Dave Patterson, Koushik Sen, Kathy Yelick, … Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis Tried to learn from successes in high-performance computing (LBNL) and parallel embedded (BWRC) Led to “Berkeley View” Tech. Report 12/2006 and new Parallel Computing Laboratory (“Par Lab”) 2008: after open competition, Intel/MS award $10M Goal: Productive, Efficient, Correct, Portable SW for 100+ cores & scale as core increase every 2 years (!) 4
Need a Fresh Approach to Parallelism • Past parallel projects often dominated by hardware/architecture • This is the one true way to build computers: software must adapt to this breakthrough • ILLIAC IV, Thinking Machines CM-2, Transputer, Kendall Square KSR-1, Silicon Graphics Origin 2000 … • Or sometimes by programming language • This is the one true way to write programs: hardware must adapt to this breakthrough • Id, Backus Functional Language FP, Occam, Linda, High Performance Fortran, Chapel, X10, Fortress … • Apps usually an afterthought
Par Lab’s original “bets” Let compelling applications drive research agenda Software platform: data center + mobile client Identify common programming patterns Productivity versus efficiency programmers Autotuning and software synthesis Build-in correctness + power/performance diagnostics OS/Architecture support applications, provide primitives not pre-packaged solutions FPGA simulation of new parallel architectures: RAMP Co-located integrated collaborative center Above all, no preconceived big idea - see what works driven by application needs. 6
Par Lab Research Overview Easy to write portable code that runs efficiently on manycore Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Design Patterns/Motifs Productivity Languages Productivity Layer Parallel Libraries Parallel Frameworks Directed Testing Selective Embedded JIT Specialization Correctness Diagnosing Power/Performance Efficiency Languages Dynamic Checking Efficiency Layer Autotuners Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU ParLab Manycore/RAMP
Par Lab, ~2 years in • How are our bets working out? • What big new ideas are emerging? • Where are we going in next 3 years? 8
Dominant Application Platforms Data Center or Cloud (“Server”) Laptop/Handheld (“Mobile Client”) Both together (“Server+Client”) ParLab-RADLab/AMPLabcollaborations Par Lab focuses on mobile clients But many technologies apply to data center Shift in Key Platforms: ‘90s: Desktop/workstation ‘00s: Laptops ‘10s: Smartphone/Tablet/TV + Cloud Implications? Smaller clients, fewer cores/client? More need for intra-task parallelism in data centers? 9 9
Music and Audio Applications(David Wessel) Musicians have an insatiable appetite for computation + real-time demands More channels, instruments, more processing, more interaction! Jitter free latency must be low (<5 ms) Must be reliable (No clicks!) Novel Instruments & User Interfaces New composition and performance systems beyond keyboards Input devices for Laptop/Handheld Enhanced sound delivery systems using large microphone and speaker arrays Music Information Retrieval (MIR) systems for client & cloud 10
Health Application: Stroke Treatment(Tony Keaveny) Stroke treatment time-critical, need supercomputer performance in hospital Image scan, blood flow analysis, then with stroke, then simulate blood thinner 200,000 cases per year in US 20% not treated since too long after Potentially, 30% of those benefit Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach. 11
Health Application: Stroke Treatment(Tony Keaveny) Stroke treatment time-critical, need supercomputer performance in hospital Image scan, blood flow analysis, then with stroke, then simulate blood thinner 200,000 cases per year in US 20% not treated since too long after Potentially, 30% of those benefit Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach. 12
Content-Based Image Retrieval(Kurt Keutzer) Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events, places, and objects Relevance Feedback Query by example Similarity Metric Candidate Results Final Result Image Database SVM&Damascene code ~10-100x speedups over existing implementations, 100s downloads. Project expanded to greater set of machine vision applications. 1000’s of images 13
Robust Speech Recognition(Nelson Morgan) Meeting Diarist Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting Progress: Parallelized all main ASR components Next: Parallelization of diarization code, SEJITS Use cortically-inspired manystreamspatio-temporal features to tolerate noise 14
Parallel Browser (Ras Bodik) • Original goal: Desktop-quality browsing on handhelds (Enabled by 4G networks, better output devices) • Now: Better development environment for new browser-based applications • Language for layout and behavior • Specifying CSS layout semantics • Layout engines via synthesis and autotuning • Efficient parallel parsing components Highlights: Identified 7 major browser-based app classes, and their needs. Developing very high-level (constraints-based) layout programming model to help productivity. 15
More Applications:“Beating down our doors!” • New external application collaborators: • Computer Vision (JitendraMalik, Thomas Brox) • Computational Finance (Matthew Dixon @UCD) • Speech (Dorothea Kolossa @TU Berlin) • Natural Language Translation (Dan Klein) • Programming multitouch interfaces (ManeeshAgrawala) • Protein Docking (Henry Gabb, Intel) • Pediatric MRI (Michael Lustig, ShreyasVassanwala @Stanford)
Fast Pediatric MRI • Pediatric MRI is difficult • Children cannot keep still or hold breath • Must put children under anesthesia for long exams: risky & costly • Need techniques to accelerate MRI acquisition (sample & multiple sensors) • Reconstruction must also be fast, or time saved in acquisition is lost in compute • Current reconstruction time: 2 hours • Non-starter for clinical use • Mark Murphy (Par Lab) starts on reconstruction 9/09: Pick SW Patterns, Good SW Architecture, Good Algorithms, Autotuning • Now 1 minute (100X): Fast enough for radiologist to decide • Starting 3/10: in use for clinical study by Dr. ShreyasVasanawala at Lucille Packard Children's Hospital at Stanford
Types of Programming(or “types of programmer”) Example Languages Example Activities Max/MSP, SQL, CSS/Flash/Silverlight, Matlab, Excel, Rails Builds app with DSL and/or by customizing app framework Domain-Level (No CS courses) Uses programming frameworks, writes application frameworks (or apps) Python/Ruby/Lua Productivity-Level (Some CS courses) Haskell/OCamL/F# Where & how is parallelism made visible? Scala Java/C# Uses hardware/OS primitives, builds programming frameworks (or apps) Efficiency-Level (MS in CS) C/C++/FORTRAN assembler Provides hardware primitives and OS services Hardware/OS 18
Where to make parallelism visible? Not in a Domain-Specific Language Should focus on making domain experts productive Too many domains, new domains, multiple domains/app Not in a new general-purpose parallel language An oxymoron? Won’t get adopted. Most big applications written in >1 language. Par Lab: Pattern-based software components Components use any language or parallel prog. model Pattern-specific compilation to attain efficiency Flexible, efficient composition of separately developed software components 19
Motifs common across applications App 1 App 2 App 3 Berkeley View Motifs (“Dwarfs”) Dense Sparse Graph Trav. 20
Motif (nee “Dwarf”) Popularity (Red HotBlue Cool) How do compelling apps relate to 12 motifs? 21
“Our” Pattern Language (OPL-2010)(Kurt Keutzer, Tim Mattson) Applications Graphical-Models Finite-State-Machines Backtrack-Branch-and-Bound N-Body-Methods Circuits Spectral-Methods Monte-Carlo Computational Patterns Structural Patterns Model-View-Controller Iterative-Refinement Map-Reduce Layered-Systems Arbitrary-Static-Task-Graph Computational Patterns Structural Patterns Graph-Algorithms Dynamic-Programming Dense-Linear-Algebra Sparse-Linear-Algebra Unstructured-Grids Structured-Grids Pipe-and-Filter Agent-and-Repository Process-Control Event-Based/Implicit-Invocation Puppeteer A = MxV Concurrent Algorithm Strategy Patterns Refine Towards Implementation Discrete-Event Geometric-DecompositionSpeculation Task-ParallelismDivide and Conquer Data-ParallelismPipeline Implementation Strategy Patterns Shared-QueueShared-mapPartitioned Graph Distributed-ArrayShared-Data Loop-Par.Task-Queue SPMDData-Par/index-space Fork/JoinActors Highlight: ~20 apps architected with patterns, patterns-to-frameworks talk later today. Wide uptake, 2ndParaPLOP Workshop Program structure Data structure Parallel Execution Patterns Thread-PoolTask-Graph Transactions MIMDSIMD Concurrency Foundation constructs (not expressed as patterns) Thread creation/destructionProcess creation/destruction Message-PassingCollective-Comm.Transactional memory Point-To-Point-Sync. (mutual exclusion)collective sync. (barrier)Memory sync/fence 22
Structural Patterns describe Component Composition • Structural patterns describe how components are assembled, e.g. Map-Reduce, Agent-and-Repository, Static-task-graph • Our belief: Any large software application must have a comprehensible software architecture describable as a hierarchy of patterns => hierarchy of components
Mapping Patterns to Hardware App 1 App 2 App 3 Dense Sparse Graph Trav. Only a few types of hardware platform Multicore GPU “Cloud” 24
High-level pattern constrains space of reasonable low-level mappings (Insert latest OPL chart showing path) 25
“Stovepipes”: pattern-specific and platform-specific compilers App 1 App 2 App 3 Dense Sparse Graph Trav. Allow maximum efficiency and expressibility in stovepipes by avoiding mandatory intermediary layers Multicore GPU “Cloud” 26
Autotuning for Code Generation • Problem: generating optimized code is like searching for needle in haystack; use computers rather than humans • Search space for block sizes (dense matrix): • Axes are block dimensions • Temperature is speed • Auto-tuners approach: program generates optimized code and data structures for a “motif” (~kernel) • ParLab autotuners for stencils (e.g., images), sparse matrices, particle/mesh, collectives (e.g., “reduce”) OpenMP Comparison Auto-tuning Auto-NUMA Auto- parallelization serial reference 27
SEJITS: “Selective, Embedded, Just-In Time Specialization” • Use modern high-level scripting language (Python, Ruby) for productivity programming • Use language facilities to embed “specializers” that map high-level pattern to efficient low-level code at (develop/install/run)-time • Specializers can incorporate autotuners to generate tuned efficiency-level code for given platform • Efficiency programmers incrementally add specializers for (Pattern-Target) pairs. Fall back to scripting runtime if no specializer exists.
SEJITS makes tuning decisions per-function (not per-app) Productivity app .py @g() .c f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW
SEJITS makes tuning decisions per-function (not per-app) Selective Productivity app .py JIT Embedded @g() .c Specialization f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW
SEJITS Main Ideas • Specializer == pattern-specific compiler • exploit pattern-specific strategies that may not generalize • target specific hardware per pattern • Can happen at runtime • Productivity Level Language (PLL) program always valid even without SEJITS support • vs. incompatibly extended syntax; inspired by Dom. Spec. Embed. Lang. vs. DSL argument • Specializers can be written in PLL
Producing Software vs. Producing An Answer • SEJITS delivers adaptive parallel software • HW variation: # cores, caches, DRAM, etc. • runtime variation: resource availability (OS+HW) • SEJITS is a highly productive way to produce exactly the code variants you need • SEJITS makes research code productive • Exploit full libraries, tools, etc. of PLL • Performance competitive with ELL code => run non-toy experiments • Develop specializer to target new HW feature => Test designs with real apps
Correctness, Testing, and Debugging • Productivity Layer • no low level data races, but non-determinism could still be present • specify and verify semantic determinism and atomicity, i.e. there is no bug due to parallelism • separates parallel correctness from functional correctness • Efficiency Layer • data races, deadlocks, memory model related bugs, atomicity violations, livelocks • Active testing: combines static and dynamic analyses so rare concurrency bugs discovered quickly and precisely • Debugging • Record and replay • Simplify a buggy concurrent trace • reduce number of context switches • reduce length of the trace • Concurrent Breakpoints • Highlight: • ACM SIGSOFT Distinguished Paper Awards at ICSE 09 and FSE 09 • IFIP TC2 Manfred Paul Best Paper Award at ICSE 10 33
Key Culprit: Nondeterminism • Determinism key to parallel correctness • Same input ==> semantically same output • Parallelism is wrong if some schedules give a correct answer while others don’t • Lightweight spec of parallel correctness • Independent of functional specification • Can effectively test deterministic specs • New Approach: Automatically infer deterministic specifications by observing sample program runs • Result: Recovered previous manual specifications for most benchmarks
Tessellation OS: Space-Time Partitioning + 2-Level Scheduling 1st level: OS determines coarse-grain allocation of resources to jobs over space and time 2nd level: Application schedules component tasks onto available “harts” (hardware thread contexts) using Lithe Space Address Space A Address Space B CPU CPU CPU CPU CPU CPU Task L1 L1 L1 L1 L1 L1 Time 2nd-level Scheduling Space Tessellation Kernel (Partition Support) L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L1 Interconnect DRAM DRAM DRAM DRAM DRAM DRAM Prototype ROS running on RAMP Gold and Nehalem-x86 DRAM & I/O Interconnect
Software must be adaptive • Environment-adaptive • Number of cores, speed of cores, soft errors, multiprogramming, battery life • SPMD/Static load-balancing/static mapping, things of the past • Input-adaptive • Degree of parallelism depends on input parameters • Adaptation managed by: • OS resource manager (“how many resources best for this app?”) • Real-time app (“how many resources to meet deadline?”) • Quality-adjustable app (“what’s possible with these resources?”) • Need hardware measurement facilities Future: Need to work on efficient adaptively parallel algorithms.
Par Lab “Multi-Paradigm” Architecture L1I$ Fat Tile Control Processor(ILP) L1D$ Core • Single “Fat” ILP-focused Tile Control Processor • Multiple “Thin” Lane Control Processors embedded in vector-thread lane Tile Thin Scalar ControlProc. Thin Scalar ControlProc. Thin Scalar ControlProc. Vector-ThreadLane Vector-ThreadLane Vector-ThreadLane Tile-Private L2U$ Par Lab Architecture research using a “superset” architecture that can model various forms of scalar + vector machine, and various forms of memory hierarchy and interconnect. Shareable L3$/LL$ • Tile Control Processor, Lane Control Processor, and Vector-Thread microthreads all run the same ISA, but microarchs optimized for different forms of parallelism 37
Hardware Measurement: This we believe • Parallel HW/SW must support performance-portable parallel software • If you expect programmers to continue “Moore’s Law” by doubling amount of portable parallelism in programs every 2 years*, need hardware measurement for them to see how well doing • During development inside an IDE • During runtime so that app, resource scheduler, and OS can see and adapt *ShekharBorkar, Intel, CNET News, 2009
RAMP Gold Rapid accurate simulation of manycore architectural ideas using FPGAs Initial version models 64 coresof SPARC v8 with shared memory system on $750 board Hardware FPU, MMU, boots OS. Highlight: RAMP Gold in production use (ISCA/HotPar/DAC papers based on RAMP Gold results). 39
Co-located Collaborative Center Approach • Continual collaboration sometimes hard work, but very beneficial. Many ideas emerge from, or are honed by, cross-subproject discussions. • E.g. Eigensolvers for Damascene application • E.g. Audio framework built on Lithe • E.g. Sparse-matrix language for library development • Integration with other components/layers forces real implementations not just paper prototypes • Whole stack demo on RAMP Gold • Can’t see how to have made this much progress otherwise
Par Lab Stats • 16 faculty, 62 PhD students, 4 postdocs • 125+ papers • Par Lab cover story, CACM, Oct. ‘09 • “The Trouble with Multicore,” IEEE Spectrum, July ‘10 • Workshops / Summer Courses @ UCB • UCB “Bootcamp”: 196 attend 2008, 397 in 2009 • Next is August 16-18, 2010; see Par Lab Web Site • 2 Founding Companies: Intel and Microsoft • 6 Affiliates: National Instruments, NEC, Nokia, NVIDIA, Oracle, and Samsung
Par Lab Summary • Multicore challenge hardest for CS in 50 years: Performance “Moore’s Law” up to programmers! • Unveil 2X parallelism/program every 2 years • Software platform: mobile client + cloud • Identify common programming patterns to reveal parallelism and use good software architecture • Target productivity versus efficiency programmers • Autotuning & Specialized Embedded JIT Specialization • Active Testing • OS/Architecture support isolation, measurement • FPGA simulation of new parallel architectures: RAMP • No preconceived silver bullet; let’s apps decide • Already many apps with significant speedup
Backup Slides & References • Armbrust, M., A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “A View of Cloud Computing,” Communications of the ACM, 53:3, March 2010. • Asanović, K., R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, K. Yelick, "A View of the Parallel Computing Landscape,” Communications of the ACM, 52:10, October 2009. • J. Burnim and K. Sen. Asserting and checking determinism for multithreaded programs. Communications of the ACM, 53(6), June 2010. • Catanzaro, B., A. Fox, K. Keutzer, D. Patterson, B-Y. Su, M.Snir, K. Olukotun, P. Hanrahan, and H. Chafi, “Ubiquitous Parallel Computing from Berkeley, Illinois and Stanford,” IEEE Micro, March/April 2010. • Catanzaro, B., S. Kamil, Y. Lee, K. Asanović, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox, "SEJITS: Getting Productivity and Performance with Selective Embedded JIT Specialization,” 1st Workshop on Programmable Models for Emerging Architecture at the 18th Int’l Conf. on Parallel Architectures and Compilation Techniques, Raleigh, North Carolina, November 2009. • J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz. Resource management in the Tessellation manycore OS. HotPar’10, Berkeley, CA, USA, June 2010. • Tan, Z., A. Waterman, S. Bird, H. Cook, K. Asanović, and D. Patterson, “A Case for FAME: FPGA Architecture Model Execution,” Proc. ISCA, June 2010.
Par Lab Apps • What are the compelling future workloads? • Need apps of future vs. legacy to drive agenda • Improve research even if not the real killer apps • Computer Vision: Segment-Based Object Recognition, Poselet-Based Human Detection • Health: MRI Reconstruction,Stroke Simulation • Music: 3D Enhancer, Hearing Aid, Novel UI • Speech: Automatic Meeting Diary • Video Games: Analysis of Smoke 2.0 Demo • Computational Finance: Value-at-Risk Estimation, Crank-Nicolson Option Pricing • Parallel Browser: Layout, Scripting Language
Developing Parallel Software • Conventional: Measure and recode slow pieces with more threads • Par Lab: Find right SW architecture that reveals parallelism Application Specification SW Arch. to Identify Parallelism Thought Experiment: Map SW Arch. to HW Arch. Not fast enough Performance profile Write / Debug Code Fast enough Ship it
Correctness, Testing, and Debugging • Active testing: combines static and dynamic analyses so that rare concurrency bugs are discovered quickly and precisely • Burnim, Sen “Asserting and Checking Determinism for Multithreaded Programs” • ACM SIGSOFT Distinguished Paper Award + CACM Research Highlight Invitation • Naik, Park, Sen, Gay “Effective Static Deadlock Detection” • ACM SIGSOFT Distinguished Paper Award • Actively control scheduler to force potentially buggy schedules: Data races, Atomicity Violations, Deadlocks • Found parallel bugs in real production OSS code: Apache Commons Collections, Java Collections Framework, Java Swing GUI framework, and Java Database Connectivity (JDBC)
SHOT Functional Requirements • Standardized Hardware Operation Tracker: SHOT • Low latency reads so deployed in production code • Can be read by OS and by user apps • To be used by virtual machines, must be able to save and restore as part of context switch • Since some counters are per core, SW must read all counters as if on same clock edge • Don’t need to be perfect counts, just consistent: accuracy ± 1% OK
Minimum SHOT Architecture • Global real time clock (vs. count clock cycles) • Since clock rate varies due to Dynamic Voltage and Frequency Scaling (DVFS) • ~ 100 MHz (fast enough for apps) • Count Number instructions retired per core • Measure computation throughput • Count off-chip memory traffic (incl. prefetching) • Key to performance and energy • Standard so apps and OS can rely on them • Standardized Hardware Measurement as important as IEEE Floating Point Standard?
Make productivity programmers efficient, and efficiency programmers productive? • Autotuning problem: The search space is large—taking a lot of cycles to explore and a long time • Search Full Parameter Space • More than 180 Days • Using machine learning + few performance counters to democratize autotuning • 12 minutes to find solution As good or even beat the expert designed autotuner! • -1% and 16% for a 7-pt Stencil • -2% and 15% for a 27-pt Stencil • 18% and 50% for dense matrix • Enables even greater range of optimizations than we imagined
Why might we succeed this time? • No Killer Microprocessor to Save Programmers • No one is building a faster serial microprocessor • For programs to go faster, SW must use parallel HW • New Metrics for Success vs. Linear Speedup • Real Time Latency/Responsiveness and/or MIPS/Joule • Just need some new killer parallel apps vs. all legacy SW must achieve linear speedup • Necessity: All the Wood Behind One Arrow • Whole industry committed, so more working on it • If future growth of IT depends on faster processing at same price (vs. lowering costs like NetBook)