HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM March 26-27, 2002

Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness HiDISC: Hierarchical Decoupled Instruction Set Computer • New Ideas • A dedicated processor for each level of the memory hierarchy • Explicit management of each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data access predictability to data access locality • Exploit instruction-level parallelism without extensive scheduling hardware • Zero overhead prefetches for maximal computation throughput • Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • 7.4x speedup for matrix multiply over an in-order issue superscalar processor • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes Schedule • Defined benchmarks • Completed simulator • Performed instruction-level simulations on hand-compiled • benchmarks • Continue simulations of more benchmarks (SAR) • Define HiDISC architecture • Benchmark result • Update Simulator • Developed and tested a complete decoupling compiler • Generated performance statistics and evaluated design April 2001 March 2002 Start

Accomplishments • Design of the HiDISC model • Compiler development (operational) • Simulator design (operational) • Performance Evaluation • Three DIS benchmarks (Multidimensional Fourier Transform, Method of Moments, Data Management) • Five stressmarks (Pointer, Transitive Closure, Neighborhood, Matrix, Field) • Results • Mostly higher performance (some lower) • Range of applications • HiDISC of the future

Outline • Original Technical Objective • HiDISC Architecture Review • Benchmark Results • Conclusion and Future Works

Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness HiDISC: Hierarchical Decoupled Instruction Set Computer • Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) • Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] • Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs • Present Solutions: Larger Caches, Prefetching (software & hardware), Simultaneous Multithreading

Present Solutions • Solution • Larger Caches • Hardware Prefetching • Software Prefetching • Multithreading Limitations • Slow • Works well only if working set fits cache and there is temporal locality. • Cannot be tailored for each application • Behavior based on past and present execution-time behavior • Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching • Adaptive software prefetching is required to change prefetch distance during run-time • Hard to insert prefetches for irregular access patterns • Solves the throughput problem, not the memory latency problem

The HiDISC Approach • Observation: • Software prefetching impacts compute performance • PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching • Approach: • Add a processor to manage prefetching  hide overhead • Compiler explicitly manages the memory hierarchy • Prefetch distance adapts to the program runtime behavior

L2 Cache and Higher Level What is HiDISC? • A dedicated processor for each level ofthe memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) • Exploit instruction-level parallelism without extensive scheduling hardware • Zero overhead prefetches for maximal computation throughput Computation Processor (CP) 2-issue Registers Store Address Queue Load Data Queue Access Processor (AP) Store Data Queue Slip Control Queue 3-issue L1 Cache Cache Mgmt. Processor (CMP) 3-issue HiDISC

Slip Control Queue • The Slip Control Queue (SCQ) adapts dynamically • Late prefetches = prefetched data arrived after load had been issued • Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop) while (not EOD) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); Inner Loop Convolution Access Processor Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code

General View of the Compiler Source Program Gcc Binary Code Disassembler Stream Separator Computation Code Cache Mgmt. Code Access Code HiDISC Compilation Overview

Sequential Source Data Dependency Graph Defining Load/Store Instructions Instruction Chasing for Backward Slice Access Stream Computation Stream Insert prefetchInstructions Insert Communication Instructions Access Code Computation Code Cache Management Code HiDISC Stream Separator

Stream Separation: Backward Load/Store Chasing • Other remaining instructions are sent to the Computation stream • Load/store instructions and “backward slice” are included as Access stream LLL1 • Communications between the two streams are also defined at this point

Stream Separation: Creating an Access Stream • Communication needs to take place via the various hardware queues. • Insert Communication instructions • CMP instructions are copied from access stream instructions, except that load instructions are replaced by prefetch instructions

DIS Benchmarks • Application oriented benchmarks • Main characteristic: Large data sets – non contiguous memory access and no temporal locality

DIS Benchmarks Description

Atlantic Aerospace Stressmark Suite • Smaller and more specific procedures • Seven individual data intensive benchmarks • Directly illustrate particular elements of the DIS problem • Fewer computation operations than DIS benchmarks • Focusing on irregular memory accesses

Stressmark Suite * DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division

Pointer Stressmarks • Basic idea: repeatedly follow pointers to randomized locations in memory • Memory access pattern is unpredictable (input data dependent) • Randomized memory access pattern: • Not sufficient temporal and spatial locality for conventional cache architectures • HiDISC architecture should provide lower average memory access latency • The pointer chasing in AP can run ahead without blocking by the CP

Simulation Parameters (simplescalar 3.0, 2000)

Superscalar and HiDISC • Ability to vary the memory latency • Superscalar supports OOO issue with 16 RUU (Register Update Unit) and 8 LSQ (Load Store Queue) • Each of AP and CP issues instructions in order

DIS Benchmark/Stressmark Results ↑: better with HiDISC ↓: better with Superscalar

DIS Benchmark Results

DIS Benchmark Performance • DIS benchmarks perform extremely well in general with our decoupled processing for the following reasons: • Many long latency floating-point operations • Robust with longer memory latency (eg, Method of Moment is a more stream-like process) • In FFT, the HiDISC architecture also suffers from longer memory latencies (Due to the data dependencies between two streams) • DM is not affected by the longer memory latency in either case.

Stressmark Results

Stressmark Results – Good Performance • Pointer chasing can be executed far ahead by using decoupled access stream • It does not requires the computation results from the CP • Transitive closure also produces good results • Not much in the AP depend on the results of CP • AP can run ahead

Some Weak Performance

Performance Bottleneck • Too much synchronization causes loss of decoupling • Unbalanced code size between two streams • Stressmark suite is highly “access-heavy” • Application domain for decoupled architectures: • Balanced ratio between computation and memory access

Synthesis • Synchronization degrades performance • When the dependency of AP on CP increases, the slip distance of AP cannot be increased • More stream like applications will benefit from using HiDISC • Multithreading support is needed • Applications should contain enough computation (1 to 1 ratio) to hide the memory access latency • CMP should be simple • It executes redundant operations if the data is already in cache • Cache pollution can occur

Flexi-DISC • Fundamental characteristics: • inherently highly dynamic at execution time. • Dynamic reconfigurable central computational kernel (CK) • Multiple levels of caching and processing around CK • adjustable prefetching • Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

Flexi-DISC • Partitioning of Computation Kernel • It can be allocated to the different portions of the application or different applications • CK requires separation of the next ring to feed it with data • The variety of target applications makes the memory accesses unpredictable • Identical processing units for outer rings • Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing