1 / 28

John D. Leidel jleidel <at> ttu <dot> edu

An Introduction to Goblin-Core64 A Massively Parallel Processor Architecture Designed for Complex Data Analytics. John D. Leidel jleidel <at> ttu <dot> edu. Overview. Data Intensive Computing Architectural Challenges The destruction of cache efficiency using irregular algorithms

jovan
Download Presentation

John D. Leidel jleidel <at> ttu <dot> edu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Goblin-Core64A Massively Parallel Processor Architecture Designed for Complex Data Analytics John D. Leidel jleidel<at>ttu<dot>edu

  2. Overview • Data Intensive Computing Architectural Challenges • The destruction of cache efficiency using irregular algorithms • Goblin-Core64 Architecture Infrastructure Design • Sustainable Exascaleperformance with data intensive applications • Progress and Roadmap • The path forward

  3. The destruction of cache efficiency using irregular algorithms Data Intensive Computing Architectural Challenges

  4. What is Big Data?…and how does it relate to HPC? • Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC • Complexity • Time to Solution • Scale • These problems are generally not • Simulating the physical world • Bound by simple floating point performance • As the problem scales, the result set is fixed • These problems are generally • Sparse in nature • Contain complex [sometimes unconstrained] data types • As the problem scales, the result set scales • The other side of the HPC coin

  5. Three Drivers to HPC Solutions HPC HPC HPC

  6. Convergence Criteria for HPC Adoption • Time + Complexity • Fraud Detection • High Performance Trading Analytics • Time + Scale • Power Grid Analytics • Graph500 Benchmark • Complexity + Scale • Epidemiology • Agent-Based Network Analytics • Time + Complexity + Scale • Grand Challenge Problems • Cyber Analytics

  7. Dense Solver Efficiency

  8. Sparse Solver Efficiency Cache-less Architectures

  9. Sustainable Exascale performance with data intensive applications Goblin-Core64 Architecture Infrastructure Design

  10. Goal Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput

  11. The Result: Goblin-Core64 • Hierarchical set of architectural modules that provide: • Native PGAS memory addressing • High efficiency RISC ISA • SIMD capabilities • Architectural notion of “tasks” • Latency hiding techniques • Single cycle context/task switching • Advanced synchronization techniques • Ease the burden of barriers and sync points by eliminating spin waits • Memory coalescing//aggregation • Local requests • Global requests • AMO’s • Makes use of latest high bandwidth memory technologies • Hybrid Memory Cube

  12. Goblin-Core64 Modules Task Proc Task Unit Task Group 1 2 M M U ALU Task Reg SIMD GC64 Socket Coalesce Unit S O C N E T AMO Unit 4 3 HMC Memory Interface Software Managed Scratchpad Packet Engine Peripherals

  13. GC64 Module Hierarchy • Task Unit • Small divisible unit; Register file + control logic • Task Proc • Multiple Task Units + context switch control logic • Task Group • Multiple Task Procs + local MMU • GC64 Socket • Coalesce Unit: coalesces adjacent memory requests into a single payload • AMO Unit: intelligently handles local AMO requests • HMC Unit: HMC packet request engine + SERDES • Software Managed Scratchpad: on-chip memory • Packet Engine: off-chip memory interface

  14. GC64 Scalable Units ………..

  15. GC64 Execution Model • GC64 execution model provides “pressure-driven”, single cycle context switching between threads/tasks • Pressure state machine provides fair-sharing of ALU based upon: • Number of outstanding requests • Statistical probability of a register stall • Number of cycles in current execution context • Minimum execution is two instructions • Based upon instruction format ALU SIMD Context Switch State Machine Task Unit Mux

  16. GC64 Unified [PGAS] Addressing • GC64 physical addressing provides block access to: • Local HMC [main] memory • Local software-managed scratchpad • Globally mapped [remote] memory • Pointer arithmetic between memory spaces • Obeys all the constraints of paged, virtual memory Physical Address Specification Unused [63:50] Socket [49:42] Reserved [41:38] CUB [37:33] Base Physical Address [33:0] Remote Socket ID CUB = 0xF One of 8 local HMC devices Remote Memory Scratchpad Local HMC Memory Physical Address Destinations

  17. GC64 ISA • Simple 64-bit instruction format • Two instructions per payload • Optional immediate payload • Instruction control block • Specifies immediate values, breakpoints and vector register aliasing

  18. GC64 Vector Register Aliasing • Vector register aliasing provides access to scalar register file from SIMD unit • No need for additional vector register file • Increasing the data path, not the physical storage • Compiler optimizations can be used to perform complex, irregular operations • Vector-Scalar-Vector arithmetic • Vector Fill • Scatter/Gather

  19. GC64 Potential Performance - SIMD Width = 4 - Task Issue Rate = 2 - Cycle = 1Ghz *Max config

  20. The path forward Progress and Roadmap

  21. GC64 Progress Report • Complete • GC64 ISA definition • Physical Address Format • Execution Model • HMC Simulator [to be used in the GC64 sim] • 2 x papers submitted, third paper in progress • First academic publications on Hybrid Memory Cube technology • In Progress • Architecture specification document • GC64 Simulator • ABI definition • Virtual addressing model • Compiler & Binutils [LLVM] • Active Research Topics • Memory coalescing & AMO techniques [Spring 2014] • Context switch pressure model • Software managed scratchpad organization • Off-chip network protocol • Thread/task runtime optimization

  22. HMC-Sim Stream Triad Results

  23. Goblin-Core64 • Source code and specification is licensed under a BSD license • www.gc64.org • Source code • Architecture documentation • Developer documentation

  24. References [1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011. [2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012. [3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12).IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/ [7] http://www.hybridmemorycube.org/ [8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf [9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.

  25. backup

  26. What did we learn from CHOMP? • Pros • We can design tightly coupled ISA’s and runtime models that are extremely efficient • Each instruction becomes precious & necessary • Code generation is quite natural • Allows the compiler to the best opportunities for optimization • Latency hiding characteristics function as designed • Single-cycle context switch mechanisms function as designed • AMO’s are increasingly useful • For more than just memory protection • RISC ISA’s are still providing high performance architectures • VLSI and JIT’ing is unnecessary overhead/area

  27. What did we learn from CHOMP? • Cons • NEED MORE BANDWIDTH! • Tests with dense per-thread memory operations could utilize ~4X more bandwidth with no other changes • Designing for an FPGA has its constraints • The lack of ILP hinders arithmetic throughput in some applications • Even SpMV kernels can utilize ILP • Paged virtual memory is always expensive • Not all applications/programmers exploit large pages • Use the native runtime! • Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime • Cache is bad, but coalescing is good • Cache is expensive to implement and often impedes performance • We can occasionally take advantage of spatial locality at access time

  28. GC64 Research • PGAS Addressing • How do we develop a segmented address directory w/o the use of a TLB? • How do we distribute and maintain this directory while providing applications “virtual memory” security? • Efficient Synchronization • Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is orthogonal to traditional implementations • Memory Coalescing • Development of memory request coalescing algorithms for local, remote and AMO-type requests • How do our synchronization techniques play into this? • Software-Managed Scratchpad • Is there room/desire to add features such as this? • What additional pressure does this put on the programming model/compiler? • Compilation Techniques • Providing MTA-C style loop transformations • https://sites.google.com/site/parallelizationforllvm/loop-transforms • Runtime Models • Building a machine-level optimized runtime that is programming model agnostic • HMC Integration • HMC 1.0 specification is available. Very different than traditional DRAM technology

More Related