220 likes | 327 Views
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric.
E N D
The Raw ArchitectureSignal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal http://www.cag.lcs.mit.edu/raw MITLaboratoryFor ComputerScience
Outline • Motivation • Architecture • Raw Prototype • Networks • Signal Processing Applications • Status
Wire Delay and Tiled Architectures • Problem: The amount of gates we can reach in one cycle is staying constant, but our chips are getting bigger. • Solutions: • Hide wire delay latency in micro-architecture (Clustering/Hidden communication stalls) • Expose the communication to the instruction set level and allow the software exploit locality Fact 1: Number of transistors growing Fact 2: Proportionally wires not getting faster
Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality
Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer
Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer
What Are We Building?The Raw Prototype • 16 Replicated Tiles (Processors) • What is in a tile? • 8 stage Pipelined MIPS-like 32-bit processor • Pipelined Floating Point Unit • 32KB Data Cache • 32KB Instruction Memory • Interconnect Routers
Raw’s Networking Resources • 2 Dynamic Networks • Fire and Forget • Header encodes destination • 2 Stage router pipeline • 2 Static Networks • Software configurable crossbar • Interlocked and Flow Controlled • 5 Stage static router pipeline • 3 cycle nearest-neighbor ALU to ALU communication latency • No header overhead, but requires knowledge of communication patterns at compile time
Memory Mapped Communication is Not a First Class Citizen To other tiles, through memory system that happens to go over a network. E M1 M2 A TL TV IF D RF F P U F4 WB
Raw’s First Class Register-Mapped Communication r24 Ex: add r26, r25, r24 r24 r25 r25 r26 r26 r27 r27 Network Output FIFOs Network Input FIFOs E M1 M2 A TL TV IF D RF F P U F4 WB
Signal Processing Applications • Problem: Increase performance of Signal Processing in a scalable fashion • Solution: Exploit parallelism in Signal Processing Applications at all levels
Types of Parallelism in Signal Processing • DSP Filter Style • Fine Grain Dataflow • Instruction Level Parallelism • Data Parallel • Thread Level Parallelism (MPI) Raw Current Architectures
Instruction Level Parallelism • RawCC • Maps dataflow graphs across tiles • ILP across Multiprocessor • Heavily Latency sensitive • Single cycle reconfigurable communication
Fine Grain Dataflow • Ex: Pipelined FIR Filter xn xn-1 xn-1 xn-3 W0 W1 W2 W3 Computation: mul, add Input Operands: xi, l Output Operands: k Cycle count Class First Second Compute 22 Communicate 03 Overall 25
Fine Grain Dataflow Cycle count Class First Second Compute 22 Communicate 03 Overall 25
FFT FFT-1 Down- Sample FFT Frequency Domain Filter FFT-1 FFT FFT-1 FFT FFT-1 DSP Filter Style Off- chip Off- chip
Raw is Composable • Mix and match types of parallelism White balance Aliasing filter White balance mem mem 2-way RawCC Application 4-way Threaded Java Application httpd Zzz.
Raw Status • Stats • IBM SA-27E .15u 6 Layer Copper • 18.2 mm X 18.2 mm die • .122 Billion Transistors • 2048KB SRAM On-chip • 1657 Pin CCGA Package • 1080 HSTL Signal IO Operating at Core Speed • 225MHz • ~25 Watts
The Raw Performance • 16 OPS/FLOPS per cycle (@225MHz = 3.6 GFLOPS) • 230 Gb/s of on-chip “bisection bandwidth” • 201 Gb/s of off-chip I/O bandwidth • 115 Gb/s of on-chip memory bandwidth
Raw Status • Working: • Cycle Accurate Software Simulator • RTL Simulation • Emulation System • RawCC ILP Compiler • Current: • Verification • Backend Completion • Tapeout December 2001 • Chips Back Summer 2002
Summary • Raw’s First Class communication facilitates exploitation of new forms of parallelism in Signal Processing applications