Compiling for Coarse-Grained Adaptable Architectures

Compiling for Coarse-Grained Adaptable Architectures Carl Ebeling Affiliates Meeting February 26, 2002

Outline • Embedded Systems Platforms • Processor + ASICs • The Performance/Power/Price crunch • Bridging the Processor/ASIC gap • The role of “adaptable hardware” • Coarse-grained vs. fine-grained • Compiling to coarse-grained architectures • Scheduling via Place and Route

Platform-Based Systems • One platform for many different systems • Leverage economy of scale • Beyond “SOC” • Mix of processors and ASIC components • Processor • General system code • ASICs • High performance/low power

Processor/ASIC Efficiency ASIC Processor

The Platform Problem • Performance/power demands increasing rapidly • More functionality pushed into ASICs • Platform becomes special-purpose • Lose economy of scale • Solution: “Programmable ASICs” • Hardware that looks like software

Adaptive Computing • ASIC is a “fixed instruction” architecture • Construct dataflow graph at fab time • Arbitrary size, complexity • Adaptive computing – change that instruction • Construct the dataflow graphs “on the fly” • Adapt the architecture to the problem

FPGA-Based Adaptive Computing • FPGAs can be used to implement arbitrary circuits • Build ASIC components on-the-fly • Many styles • Configurable function units • Configurable co-processors

The Problem with FPGAs • Cost • >100x overhead • Bit logic functions and routing • Great for bit-ops, FSMs; lousy for arithmetic • Power • Functions are widely spaced  long wires • Programming model • Mapping computation to HW is time-consuming • Weeks/months for relatively small problems

Coarse-Grained Architectures • LUTs  Arithmetic operations • Wires  Data busses • Decreases overhead substantially • 25x - 100x gain • Large potential impact on embedded platforms • Compiling is the big challenge

Rapid Architecture • Merging of processor and FPGA ideas • Start with array of function units

Rapid • Add registers • “Distributed register file”

Rapid • Add small distributed memories • Save data locally for reuse

Rapid • Add I/O ports • Streaming data interfaces

Rapid • Add interconnect network • Segmented busses • Multiplexers

Interconnect Control • Interconnect is modeled using muxes • FU inputs are muxed • Bus inputs are muxed • Bus hierarchy possible • Bus inputs from FU’s and other buses

Control Signals • Function units • e.g. ALU opcodes • e.g. Memory R/W • Mux controls • How many? • ~20/FU (including muxes) • >50 FUs • >1000 control signals

Configuring Control Control Fab-time decision “Hard” (~60%) Configurable, ala FPGA “Soft” (~40%) Compile-time decisionStatically Configured Configured Static (~30%)Does not change during current app Dynamic (~10%)Changes under program control

Proposed Tool Flow

Control-Dataflow Graph

Example Dataflow Graph • Add subsequences of length 3 • transform to: for (i=2; i<N; i++) { Y[i] = X[i-2] + X[i-1] + X[i];} A = X[1]; B = X[0];for (i=2; i<N; i++) { Y[i] = A + B + X[i]; B = A; A = X[i];}

Example Dataflow Graph • DFG for one iteration • Combinational - executed in one cycle • DFG is in a loop, executed repeatedly • Linked to other DFGs via registers

Stitching Dataflow Graphs

Scheduling Dataflow Graphs • Mapping operations/values in space and time • Key problems • Data interconnect • No crossbar, no central register file • Control constraints • Hard control – one decision for all time • Control optimization • Soft control – maximize sharing • Place & Route formulation allows simultaneous solution of all constraints

Example Datapath Graph • Two adders with pipeline registers • Two input streams, two output streams • Two datapath registers • Two pairs of interconnect registers

Datapath Execution • Control determines what datapath does • Possibly different each clock cycle • Datapath Execution (DPE) • Computation performed in one clock cycle • Starts/ends with clock tick • Combinational logic

Space/Time Execution Cycle 1

Space/Time Execution Cycle 1 Cycle 2

Space/Time Execution Cycle 1 Cycle 2 Cycle 3

Dataflow Graph Execution Cycle 1 Cycle 2 Cycle 3

Start an Execution Every Cycle Cycle 1 Cycle 2 Cycle 3

Connect DFG Outputs to Inputs Cycle 1 Cycle 2 Cycle 3

Dataflow Graph is in a loop • Initiation interval is one clock cycle • Wrap register outputs back to the top • Gives an iterative modulo schedule

Result of Scheduling: Control Matrix Control signals • Only soft control • Control values • 0/1 • x – don’t care • f( ) • Status signals • Control variables • Control optimization • Compress matrix time

Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • One instruction bit

Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • Increases sharing

Conclusion • New role for adaptive computing • Solution for embedded systems platforms • Coarse-grained architectures • Reduce configurability overhead • Merge ideas from processors and FPGAs • Compiling is the key challenge • Finding parallelism is not the problem • Scheduling data movement • Use Place & Route to solve many simultaneous constraints

Compiling for Coarse-Grained Adaptable Architectures

Compiling for Coarse-Grained Adaptable Architectures

Presentation Transcript

Development of quantitative coarse-grained simulation models for polymers

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

A Coarse-grained Model for the Formation of Caveolae

Course-Grained Reconfigurable Architectures

REGIMap: Register-Aware Application Mapping on Coarse-Grained Reconfigurable Architectures

A Coarse-grained Model for the Formation of Caveolae

Creating Coarse-grained Parallelism for Loop Nests

Coarse-grained Word Sense Disambiguation

Coarse-Grained Transactions

Coarse-Grained Transactions

Fine-grained and Coarse-grained Word Sense Disambiguation

Coarse-Grained Traffic Analysis in ISP Networks

Atomistic vs. Coarse Grained Simulations

Coarse-Grained Coherence

Coarse-Grained Theory of Surface Nanostructure Formation

parXXL : A Fine Grained Development Environment on Coarse Grained Architectures

Development of quantitative coarse-grained simulation models for polymers

Commutativity and Coarse-Grained Transactions

Packet Classification Using Coarse-Grained Tuple Spaces

Coarse Grained Interoperability scenarios

Development of quantitative coarse-grained simulation models for polymers