390 likes | 565 Views
Compiling for Coarse-Grained Adaptable Architectures. Carl Ebeling Affiliates Meeting February 26, 2002. Outline. Embedded Systems Platforms Processor + ASICs The Performance/Power/Price crunch Bridging the Processor/ASIC gap The role of “adaptable hardware”
E N D
Compiling for Coarse-Grained Adaptable Architectures Carl Ebeling Affiliates Meeting February 26, 2002
Outline • Embedded Systems Platforms • Processor + ASICs • The Performance/Power/Price crunch • Bridging the Processor/ASIC gap • The role of “adaptable hardware” • Coarse-grained vs. fine-grained • Compiling to coarse-grained architectures • Scheduling via Place and Route
Platform-Based Systems • One platform for many different systems • Leverage economy of scale • Beyond “SOC” • Mix of processors and ASIC components • Processor • General system code • ASICs • High performance/low power
Processor/ASIC Efficiency ASIC Processor
The Platform Problem • Performance/power demands increasing rapidly • More functionality pushed into ASICs • Platform becomes special-purpose • Lose economy of scale • Solution: “Programmable ASICs” • Hardware that looks like software
Adaptive Computing • ASIC is a “fixed instruction” architecture • Construct dataflow graph at fab time • Arbitrary size, complexity • Adaptive computing – change that instruction • Construct the dataflow graphs “on the fly” • Adapt the architecture to the problem
FPGA-Based Adaptive Computing • FPGAs can be used to implement arbitrary circuits • Build ASIC components on-the-fly • Many styles • Configurable function units • Configurable co-processors
The Problem with FPGAs • Cost • >100x overhead • Bit logic functions and routing • Great for bit-ops, FSMs; lousy for arithmetic • Power • Functions are widely spaced long wires • Programming model • Mapping computation to HW is time-consuming • Weeks/months for relatively small problems
Coarse-Grained Architectures • LUTs Arithmetic operations • Wires Data busses • Decreases overhead substantially • 25x - 100x gain • Large potential impact on embedded platforms • Compiling is the big challenge
Rapid Architecture • Merging of processor and FPGA ideas • Start with array of function units
Rapid • Add registers • “Distributed register file”
Rapid • Add small distributed memories • Save data locally for reuse
Rapid • Add I/O ports • Streaming data interfaces
Rapid • Add interconnect network • Segmented busses • Multiplexers
Interconnect Control • Interconnect is modeled using muxes • FU inputs are muxed • Bus inputs are muxed • Bus hierarchy possible • Bus inputs from FU’s and other buses
Control Signals • Function units • e.g. ALU opcodes • e.g. Memory R/W • Mux controls • How many? • ~20/FU (including muxes) • >50 FUs • >1000 control signals
Configuring Control Control Fab-time decision “Hard” (~60%) Configurable, ala FPGA “Soft” (~40%) Compile-time decisionStatically Configured Configured Static (~30%)Does not change during current app Dynamic (~10%)Changes under program control
Example Dataflow Graph • Add subsequences of length 3 • transform to: for (i=2; i<N; i++) { Y[i] = X[i-2] + X[i-1] + X[i];} A = X[1]; B = X[0];for (i=2; i<N; i++) { Y[i] = A + B + X[i]; B = A; A = X[i];}
Example Dataflow Graph • DFG for one iteration • Combinational - executed in one cycle • DFG is in a loop, executed repeatedly • Linked to other DFGs via registers
Scheduling Dataflow Graphs • Mapping operations/values in space and time • Key problems • Data interconnect • No crossbar, no central register file • Control constraints • Hard control – one decision for all time • Control optimization • Soft control – maximize sharing • Place & Route formulation allows simultaneous solution of all constraints
Example Datapath Graph • Two adders with pipeline registers • Two input streams, two output streams • Two datapath registers • Two pairs of interconnect registers
Datapath Execution • Control determines what datapath does • Possibly different each clock cycle • Datapath Execution (DPE) • Computation performed in one clock cycle • Starts/ends with clock tick • Combinational logic
Space/Time Execution Cycle 1
Space/Time Execution Cycle 1 Cycle 2
Space/Time Execution Cycle 1 Cycle 2 Cycle 3
Dataflow Graph Execution Cycle 1 Cycle 2 Cycle 3
Start an Execution Every Cycle Cycle 1 Cycle 2 Cycle 3
Connect DFG Outputs to Inputs Cycle 1 Cycle 2 Cycle 3
Dataflow Graph is in a loop • Initiation interval is one clock cycle • Wrap register outputs back to the top • Gives an iterative modulo schedule
Result of Scheduling: Control Matrix Control signals • Only soft control • Control values • 0/1 • x – don’t care • f( ) • Status signals • Control variables • Control optimization • Compress matrix time
Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • One instruction bit
Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • One instruction bit
Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • Increases sharing
Conclusion • New role for adaptive computing • Solution for embedded systems platforms • Coarse-grained architectures • Reduce configurability overhead • Merge ideas from processors and FPGAs • Compiling is the key challenge • Finding parallelism is not the problem • Scheduling data movement • Use Place & Route to solve many simultaneous constraints