John D. Leidel jleidel <at> ttu <dot> edu

An Introduction to Goblin-Core64A Massively Parallel Processor Architecture Designed for Complex Data Analytics John D. Leidel jleidel<at>ttu<dot>edu

Overview • Data Intensive Computing Architectural Challenges • The destruction of cache efficiency using irregular algorithms • Goblin-Core64 Architecture Infrastructure Design • Sustainable Exascaleperformance with data intensive applications • Progress and Roadmap • The path forward

The destruction of cache efficiency using irregular algorithms Data Intensive Computing Architectural Challenges

What is Big Data?…and how does it relate to HPC? • Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC • Complexity • Time to Solution • Scale • These problems are generally not • Simulating the physical world • Bound by simple floating point performance • As the problem scales, the result set is fixed • These problems are generally • Sparse in nature • Contain complex [sometimes unconstrained] data types • As the problem scales, the result set scales • The other side of the HPC coin

Three Drivers to HPC Solutions HPC HPC HPC

Convergence Criteria for HPC Adoption • Time + Complexity • Fraud Detection • High Performance Trading Analytics • Time + Scale • Power Grid Analytics • Graph500 Benchmark • Complexity + Scale • Epidemiology • Agent-Based Network Analytics • Time + Complexity + Scale • Grand Challenge Problems • Cyber Analytics

Dense Solver Efficiency

Sparse Solver Efficiency Cache-less Architectures

Sustainable Exascale performance with data intensive applications Goblin-Core64 Architecture Infrastructure Design

Goal Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput

The Result: Goblin-Core64 • Hierarchical set of architectural modules that provide: • Native PGAS memory addressing • High efficiency RISC ISA • SIMD capabilities • Architectural notion of “tasks” • Latency hiding techniques • Single cycle context/task switching • Advanced synchronization techniques • Ease the burden of barriers and sync points by eliminating spin waits • Memory coalescing//aggregation • Local requests • Global requests • AMO’s • Makes use of latest high bandwidth memory technologies • Hybrid Memory Cube

Goblin-Core64 Modules Task Proc Task Unit Task Group 1 2 M M U ALU Task Reg SIMD GC64 Socket Coalesce Unit S O C N E T AMO Unit 4 3 HMC Memory Interface Software Managed Scratchpad Packet Engine Peripherals

GC64 Module Hierarchy • Task Unit • Small divisible unit; Register file + control logic • Task Proc • Multiple Task Units + context switch control logic • Task Group • Multiple Task Procs + local MMU • GC64 Socket • Coalesce Unit: coalesces adjacent memory requests into a single payload • AMO Unit: intelligently handles local AMO requests • HMC Unit: HMC packet request engine + SERDES • Software Managed Scratchpad: on-chip memory • Packet Engine: off-chip memory interface

GC64 Scalable Units ………..

GC64 Execution Model • GC64 execution model provides “pressure-driven”, single cycle context switching between threads/tasks • Pressure state machine provides fair-sharing of ALU based upon: • Number of outstanding requests • Statistical probability of a register stall • Number of cycles in current execution context • Minimum execution is two instructions • Based upon instruction format ALU SIMD Context Switch State Machine Task Unit Mux

GC64 Unified [PGAS] Addressing • GC64 physical addressing provides block access to: • Local HMC [main] memory • Local software-managed scratchpad • Globally mapped [remote] memory • Pointer arithmetic between memory spaces • Obeys all the constraints of paged, virtual memory Physical Address Specification Unused [63:50] Socket [49:42] Reserved [41:38] CUB [37:33] Base Physical Address [33:0] Remote Socket ID CUB = 0xF One of 8 local HMC devices Remote Memory Scratchpad Local HMC Memory Physical Address Destinations

GC64 ISA • Simple 64-bit instruction format • Two instructions per payload • Optional immediate payload • Instruction control block • Specifies immediate values, breakpoints and vector register aliasing

GC64 Vector Register Aliasing • Vector register aliasing provides access to scalar register file from SIMD unit • No need for additional vector register file • Increasing the data path, not the physical storage • Compiler optimizations can be used to perform complex, irregular operations • Vector-Scalar-Vector arithmetic • Vector Fill • Scatter/Gather

GC64 Potential Performance - SIMD Width = 4 - Task Issue Rate = 2 - Cycle = 1Ghz *Max config

The path forward Progress and Roadmap

GC64 Progress Report • Complete • GC64 ISA definition • Physical Address Format • Execution Model • HMC Simulator [to be used in the GC64 sim] • 2 x papers submitted, third paper in progress • First academic publications on Hybrid Memory Cube technology • In Progress • Architecture specification document • GC64 Simulator • ABI definition • Virtual addressing model • Compiler & Binutils [LLVM] • Active Research Topics • Memory coalescing & AMO techniques [Spring 2014] • Context switch pressure model • Software managed scratchpad organization • Off-chip network protocol • Thread/task runtime optimization

HMC-Sim Stream Triad Results

Goblin-Core64 • Source code and specification is licensed under a BSD license • www.gc64.org • Source code • Architecture documentation • Developer documentation

References [1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011. [2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012. [3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12).IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/ [7] http://www.hybridmemorycube.org/ [8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf [9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.

backup

What did we learn from CHOMP? • Pros • We can design tightly coupled ISA’s and runtime models that are extremely efficient • Each instruction becomes precious & necessary • Code generation is quite natural • Allows the compiler to the best opportunities for optimization • Latency hiding characteristics function as designed • Single-cycle context switch mechanisms function as designed • AMO’s are increasingly useful • For more than just memory protection • RISC ISA’s are still providing high performance architectures • VLSI and JIT’ing is unnecessary overhead/area

What did we learn from CHOMP? • Cons • NEED MORE BANDWIDTH! • Tests with dense per-thread memory operations could utilize ~4X more bandwidth with no other changes • Designing for an FPGA has its constraints • The lack of ILP hinders arithmetic throughput in some applications • Even SpMV kernels can utilize ILP • Paged virtual memory is always expensive • Not all applications/programmers exploit large pages • Use the native runtime! • Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime • Cache is bad, but coalescing is good • Cache is expensive to implement and often impedes performance • We can occasionally take advantage of spatial locality at access time

GC64 Research • PGAS Addressing • How do we develop a segmented address directory w/o the use of a TLB? • How do we distribute and maintain this directory while providing applications “virtual memory” security? • Efficient Synchronization • Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is orthogonal to traditional implementations • Memory Coalescing • Development of memory request coalescing algorithms for local, remote and AMO-type requests • How do our synchronization techniques play into this? • Software-Managed Scratchpad • Is there room/desire to add features such as this? • What additional pressure does this put on the programming model/compiler? • Compilation Techniques • Providing MTA-C style loop transformations • https://sites.google.com/site/parallelizationforllvm/loop-transforms • Runtime Models • Building a machine-level optimized runtime that is programming model agnostic • HMC Integration • HMC 1.0 specification is available. Very different than traditional DRAM technology

John D. Leidel jleidel <at> ttu <dot> edu