300 likes | 474 Views
The Stanford Pervasive Parallelism Lab. A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis , K. Olukotun, M. Rosenblum, S. Thrun Pervasive Parallelism Laboratory Stanford University. The Looming Crisis.
E N D
The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum, S. Thrun Pervasive Parallelism Laboratory Stanford University
The Looming Crisis • Software developers will soon face systems with • > 1 TFLOP of compute power • 20+ of cores, 100+ hardware threads • Heterogeneous cores (CPU+GPUs), app-specific accelerators • Deep memory hierarchies • Challenge: harness these devices productively • Improve performance, power, reliability and security • The parallelism gap • Yawning divide between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures
The Stanford Pervasive Parallelism Laboratory • Goal: the parallel computing platform for 2012 • Make parallel programming practical for the masses • Algorithms, programming models, runtimes, and architectures for scalable parallelism (10,000s of threads) • Parallel computing a core component of CS education • PPL is a combination of • Leading Stanford researchers across multiple domains • Applications, languages, software systems, architecture • Leading companies in computer systems and software • Sun, AMD, Nvidia, IBM, Intel, HP • An exciting vision for pervasive parallelism • Open laboratory; all result in the open-source
The PPL Team • Applications • Ron Fedkiw, Vladlen Koltun, Sebastian Thrun • Programming & software systems • Alex Aiken, Pat Hanrahan, Mendel Rosenblum • Architecture • Bill Dally, John Hennessy, Mark Horowitz, Christos Kozyrakis, Kunle Olukotun (Director)
The PPL Team • Research expertise • Applications: graphics, physics simulation, visualization, AI, robotics, … • Software systems: virtual machines, GPGPU, stream programming, transactional programming, speculative parallelization, optimizing compilers, bug detection, security,… • Architecture: multi-core & multithreading, scalable shared-memory, transactional memory hardware, interconnect networks, low-power processors, stream processors, vector processors, … • Commercial success • MIPS & SGI, Rambus, VMware, Niagara processors, Renderman, Stream Processors Inc, Avici, Tableau, …
Top down research App & developer needs drive system High-level info flows to low-level system Scalability In hardware resources (10, 000s threads) Developer productivity (ease of use) HW provides flexible primitives Software synthesizes complete solutions Build real, full system prototypes Guiding Principles
The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring
The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring
Existing Stanford research center Existing Stanford CS research groups Demanding Applications • Leverage domain expertise at Stanford • CS research groups & national centers for scientific computing • From consumer apps to neuroinformatics Environmental Science Media-X DOE ASC Seismic modeling PPL NIH NCBC Geophysics AI/ML Vision Graphics Games Mobile HCI Web, Mining Streaming DB
Virtual Worlds Application • Next-gen web platform • Immersive collaboration • Social gaming • Millions of players in vast landscape • Challenges • Client-side game engine • Server-side world simulation • AI, physics, large-scale rendering • Dynamic content, huge datasets • More at http://vw.stanford.edu/
Autonomous Vehicle Application • Cars that drive autonomously in traffic • Save lives & money • Improve highway throughput • Improve productivity • Challenges • Client-side sensing, perception, planning, & control • Server-side data merging, pre-processing, & post-processing, traffic control, model generation • Real-time, huge datasets • More at http://www.stanfordracing.org
The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring
Domain Specific Languages (DSL) • Leverage success of DSL across application domains • SQL (data manipulation), Matlab (scientific), Ruby/Rails (web),… • DSLs higher productivity for developers • High-level data types & ops tailored to domain • E.g., relations, triangles, matrices, … • Express high-level intent without specific implementation artifacts • Programmer isolated from details of specific system • DSLs scalable parallelism for the system • Declarative description of parallelism & locality patterns • E.g., ops on relation elements, sub-array being processed, … • Portable and scalable specification of parallelism • Automatically adjust data structures, mapping, and scheduling as systems scale up
DSL Research & Challenges • Goal: create the tools for DSL development • Initial DSL targets • Rendering, physics simulation, analytics, probabilistic computations • Challenges • DSL implementation embed in base PL • Start with Scala (OO, type-safe, functional, extensible) • Use Scala as a scripting DSL that ties multiple DSLs • DSL-specific optimizations telescoping compilers • Use domain knowledge to optimize & annotate code • Feedback to programmers ? • …
The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring
Common Parallel Runtime (CPR) • Goals • Provide common, portable, abstract target for all DSLs • Write once, run everywhere model • Manages parallelism & locality • Achieve efficient execution (performance, power, …) • Handles specifics of HW system • Approach • Compile DSLs to common IR • Base language + low-level constructs & pragmas • Forall, async/join, atomic, barrier, … • Per-object capabilities • Read-only or write-only, output data, private, relaxed coherence, … • Combine static compilation + dynamic management • Explicit management of regular tasks & predictable patterns • Implicit management of irregular parallelism
CPR Research & Challenges • Integrating & balancing opposing approaches • Task-level & data-level parallelism • Static & dynamic concurrency management • Explicit & implicit memory management • Utilize high-level information from DSLs • The key to overcoming difficult challenges • Adapt to changes in application behavior, OS decisions, runtime constraints • Manage heterogeneous HW resources • Utilize novel HW primitives • To reduce overhead of communication, synchronization, … • To understand runtime behavior on specific HW & adapt to it
The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring
Hardware Architecture @ 2012 The many-core chip 100s of cores OOO, threaded, & SIMD Hierarchy of shared memories Scalable, on-chip network The system Few many-core chips Per-chip DRAM channels Global address space The data-center Cluster of systems TC OOO SIMD TC OOO SIMD TC TC TC TC TC TC L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L2 Memory L2 Memory I/O L3 Memory DRAM CTL L2 Memory L2 Memory L1 L1 L1 L1 L1 L1 TC OOO SIMD TC OOO SIMD
Architecture Challenges Heterogeneity Balance of resources, granularity of parallelism Support for parallelism & locality management Synchronization, communication, … Explicit Vs. implicit locality management Runtime monitoring Scalability On-chip/off-chip bandwidth & latency Scalability of key abstractions (e.g., coherence) Beyond performance Power, fault-tolerance, QoS, security, virtualization
Architecture Research Revisit architecture & micro-architecture for parallelism Define semantics & implementation of key primitives Communication, atomicity, isolation, partitioning, coherence, consistency, checkpoint Fine-grain & bulk support Software synthesizes primitives into execution systems Streaming system: partitioning + bulk communication Thread-level spec: isolation + fine-grain communication Transactional memory: atomicity + isolation + consistency Security: partitioning + isolation Fault tolerance: isolation + checkpoint + bulk communication Challenges: interactions, scalability, cost, virtualization
Architecture Research Software-managed HW primitives Exploit high-level knowledge from DSLs & CPR E.g., scale coherence using coarse-grain techniques Coarse-grain in time: force coherence only when needed Coarse-grain in space: object-based, selective coherence Support for programmability & management Fine-grain monitoring, HW-assisted invariants Build upon primitives for concurrency Efficient interface to CPR Scalable on-chip & off-chip interconnects High-radix network Adaptive routing
Research Methodology • Conventional approaches are still useful • Develop app & SW system on existing platforms • Multi-core, accelerators, clusters, … • Simulate novel HW mechanisms • Need some method that bridges HW & SW research • Makes new HW features available for SW research • Does not compromise HW speed, SW features, or scale • Allows for full-system prototypes • Needed for research, convincing for industry, exciting for students • Approach: commodity chips + FPGAs in memory system • Commodity chips: fast system with rich SW environment • FPGAs: prototyping platform for new HW features • Scale through cluster arrangement
Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Memory Memory FARM: Flexible Architecture Research Machine
Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 FPGA Core 2 Core 3 SRAM Memory Memory FARM: Flexible Architecture Research Machine IO
Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 FPGA SRAM Memory Memory FARM: Flexible Architecture Research Machine GPU/Stream IO
Memory Memory (scalable) Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Infiniband Or PCIe Interconnect Core 0 Core 1 FPGA Core 2 Core 3 SRAM Memory Memory FARM: Flexible Architecture Research Machine IO
Example FARM Uses • Software research • SW development for heterogeneous systems • Code generation & resource management • Scheduling system for large-scale parallelism • Thread state management & adaptive control • Hardware research • Scalable streaming & transactional HW • FPGA extends protocols throughout cluster • Scalable shared memory • FPGA provides coarse-grain tracking • Hybrid memory systems • Custom processors & accelerators • HW support for monitoring, scheduling, isolation, virtualization, …
Conclusions • PPL: a full system vision for pervasive parallelism • Applications, programming models, software systems, and hardware architecture • Key initial ideas • Domain-specific languages • Combine implicit & explicit management • Flexible HW features