340 likes | 600 Views
ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors. Overview. Differences between reconfigurable coprocessors, reconfigurable functional units, and soft processors Motivation Compute Models: how to fit into computation Interaction with memory is a key
E N D
ECE 636Reconfigurable ComputingLecture 15Reconfigurable Coprocessors
Overview • Differences between reconfigurable coprocessors, reconfigurable functional units, and soft processors • Motivation • Compute Models: how to fit into computation • Interaction with memory is a key • Acknowledgment: DeHon
Overview • Processors efficient at sequential codes, regular arithmetic operations. • FPGA efficient at fine-grained parallelism, unusual bit-level operations. • Tight-coupling important: allows sharing of data/control • Efficiency is an issue: • Context-switches • Memory coherency • Synchronization
Compute Models • I/O pre/post processing • Application specific operation • Reconfigurable Co-processors • Coarse-grained • Mostly independent • Reconfigurable Functional Unit • Tightly integrated with processor pipeline • Register file sharing becomes an issue
a31 a30………. a0 Swap bit positions b31 b0 Instruction Augmentation • Processor can only describe a small number of basic computations in a cycle • I bits -> 2I operations • Many operations could be performed on 2 W-bit words. • ALU implementations restrict execution of some simple operations. • e. g. bit reversal
Instruction Augmentation • Provide a way to augment the processor instruction set for an application. • Avoid mismatch between hardware/software What’s Required? • Fit augmented instructions into data and control stream. • Create a functional unit for augmented instructions. • Compiler techniques to identify/use new functional unit.
PRISC • Architecture: • couple into register file as “superscalar” functional unit • flow-through array (no state)
All compiled working from MIPS binary <200 4LUTs ? 64x3 200MHz MIPS base PRISC Results Razdan/Micro27
Chimaera • Start from Prisc idea. • Integrate as a functional unit • No state • RFU Ops (like expfu) • Stall processor on instruction miss • Add • Multiple instructions at a time • More than 2 inputs possible • Hauck: University of Washington
Chimaera Architecture • Live copy of register file values feed into array • Each row of array may compute from register of intermediates • Tag on array to indicate RFUOP
Chimaera Architecture • Array can operate on values as soon as placed in register file. • Logic is combinational • When RFUOP matches • Stall until result ready • Drive result from matching row
Chimaera Timing R5 R3 R2 R1 • If R1 presented late then stall • Might be helped by instruction reordering • Physical implementation an issue. • Relies on considerable processor interaction for support
Chimaera Results • Three Spec92 benchmarks • Compress 1.11 speedup • Eqntott 1.8 • Life 2.06 • Small arrays with limited state • Small speedup • Perhaps focus on global router rather than local optimization.
Garp • Integrate as coprocessor • Similar bandwidth to processor as functional unit • Own access to memory • Support multi-cycle operation • Allow state • Cycle counter to track operation • Configuration cache, path to memory
Garp – UC Berkeley • ISA – coprocessor operations • Issue gaconfig to make particular configuration present. • Explicitly move data to/from array • Processor suspension during coproc operation • Use cycle counter to track progress • Array may directly access memory • Processor and array share memory • Exploits streaming data operations • Cache/MMU maintains data consistency
Garp Instructions • Interlock indicates if processor waits for array to count to zero. • Last three instructions useful for context swap • Processor decode hardware augmented to recognize new instructions.
Garp Array • Row-oriented logic • Dedicated path for processor/memory • Processor does not have to be involved in array-memory path
Garp Results • General results • 10-20X improvement on stream, feed-forward operation • 2-3x when data dependencies limit pipelining • [Hauser-FCCM97]
PRISC/Chimaera vs. Garp • Prisc/Chimaera • Basic op is single cycle: expfu • No state • Could have multiple PFUs • Fine grained parallelism • Not effective for deep pipelines • Garp • Basic op is multi-cycle – gaconfig • Effective for deep pipelining • Single array • Requires state swapping consideration
Common Theme • To overcome instruction expression limits: • Define new array instructions. Make decode hardware slower / more complicated. • Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. • Give array configuration short “name” which processor can call out. • Store multiple configurations in array. Access as needed (DPGA)
Observation • All coprocessors have been single-threaded • Performance improvement limited by application parallelism • Potential for task/thread parallelism • DPGA • Fast context switch • Concurrent threads seen in discussion of IO/stream processor • Added complexity needs to be addressed in software.
Parallel Computation: Processor and FPGA • What would it take to let the processor and FPGA run in parallel? Modern Processors Deal with: • Variable data delays • Dependencies with data • Multiple heterogeneous functional units Via: • Register scoreboarding • Runtime data flow (Tomasulo)
OneChip • Want array to have direct memorymemory operations • Want to fit into programming model/ISA • w/out forcing exclusive processor/FPGA operation • allowing decoupled processor/array execution • Key Idea: • FPGA operates on memorymemory regions • make regions explicit to processor issue • scoreboard memory blocks [Jacob+Chow: Toronto]
OneChip Instructions • Basic Operation is: • FPGA MEM[Rsource]MEM[Rdst] • block sizes powers of 2 • Supports 14 “loaded” functions • DPGA/contexts so 4 can be cached • Fits well into soft-core processor model
OneChip • Basic op is: FPGA MEMMEM • no state between these ops • coherence is that ops appear sequential • could have multiple/parallel FPGA Compute units • scoreboard with processor and each other • single source operations? • can’t chain FPGA operations?
0x0 0x1000 FPGA Proc 0x10000 Indicates usage of data pages like virtual memory system! OneChip Extensions • FPGA operates on certain memory regions only • Makes regions explicit to processor issue. • Scoreboard memory blocks
Model Roundup • Interfacing • IO Processor (Asynchronous) • Instruction Augmentation • PFU (like FU, no state) • Synchronous Coproc • VLIW • Configurable Vector • Asynchronous Coroutine/coprocesor • Memorymemory coprocessor
Shadow Registers for Functional Units • Reconfigurable functional units require tight integration with register file • Many reconfigurable operations require more than two operands at a time Cong: 2005
Multi-operand Operations • What’s the best speedup that could be achieved? • Provides upper bound • Assumes all operands available when needed
Providing Additional Register File Access • Dedicated link – move data as needed • Requires latency • Extra register port – consumes resources • May not be used often • Replicate whole (or most) of register file • Can be wasteful
Shadow Register Approach • Small number of registers needed (3 or 4) • Use extra bits in each instruction • Can be scaled for necessary port size
Shadow Register Approach • Approach comes within 89% of ideal for 3-input functions • Paper also shows supporting algorithms
Summary • Many different models for co-processor implementation • Functional unit • Stand-alone co-processor • Programming models for these systems is a key • Recent compiler advancements open the door for future development • Need tie in with applications