350 likes | 535 Views
ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors. Overview. Focus on Processor and Array hybrids. Motivation Compute Models: how to fit into computation Examples: Garp, Prism, Remarc, OneChip, Prisc
E N D
ECE 697FReconfigurable ComputingLecture 19Reconfigurable Coprocessors
Overview • Focus on Processor and Array hybrids. • Motivation • Compute Models: how to fit into computation • Examples: Garp, Prism, Remarc, OneChip, Prisc • Some lecture material taken with permission from Dehon lecture on reconfigurable computing.
Compression Techniques • Processors efficient at sequential codes, regular arithmetic operations. • FPGA efficient at fine-grained parallelism, unusual bit-level operations. • Tight-coupling important: allows sharing of data/control • Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?
Motivational: Other Viewpoints • Replace interface glue logic. • I/O pre/post processing • Handle real-time responsiveness • Provide powerful, application specific operation. • Allow migration of function /performance over time.
Compute Models • Glue logic for buses, adapters. • Dedicated I/O processor • Instruction augmentation • Special instructions/coprocessor ops • VLIW/microcoded extension to processor • Configurable vector unit • Autonomous co/stream processor
Interfacing • Logic replaces: • ASIC customization • External FPGA/CPLD • Example • Bus protocols • Peripherals • Sensors, actuators • Argument • Need customization • Modern chips have capacity • Reduce part count • Migrate to system-on-a-chip • Performance/power
I/O Processor • Array dedicated to servicing to I/O channel • Sensor, LAN, WAN, peripheral • Many protocols, services • Provides protocol handling • Stream computation • Compression, encrypt • Effectively looks like I/O peripheral to processor. • Don’t need all at same time • Offload function from processor.
I/O Processing • Single threaded processor created in reconfigurable logic. • No support for multiple data pipes or multiple contexts. • Need some minimal, local control to handle events. • For performance or real-time guarantees, may need to service rapidly. • Checksum and acknowledge packets, for example
a31 a30………. a0 Swap bit positions b31 b0 Instruction Augmentation • Processor can only describe a small number of basic computations in a cycle • I bits -> 2I operations • Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words. • ALU implementations restrict execution of some simple operations. • e. g. bit reversal
Instruction Augmentation • Provide a way to augment the processor instruction set for an application. • Avoid mismatch between hardware/software What’s Required? • Fit augmented instructions into data and and control stream. • Create a functional unit for augmented instructions. • Compiler techniques to identify/use new functional unit.
Chimaera • Start from Prisc idea. • Integrate as a functional unit • No state • RFU Ops (like expfu) • Stall processor on instruction miss • Add • Multiple instructions at a time • More than 2 inputs possible • Hauck: University of Washington
Chimaera Architecture • Live copy of register file values feed into array • Each row of array may compute from register of intermediates • Tag on array to indicate RFUOP
Chimaera Architecture • Array can operate on values as soon as placed in register file. • Logic is combinational • When RFUOP matches • Stall until result ready • Drive result from matching row
Chimaera Timing R5 R3 R2 R1 • If R1 presented last then stall • Might be helped by instruction reordering • Physical implementation an issue.
Chimaera Results • Three Spec92 benchmarks • Compress 1.11 speedup • Eqntott 1.8 • Life 2.06 • Small arrays with limited state • Small speedup • Perhaps focus on global router rather than local optimization.
Garp • Integrate as coprocessor • Similar bandwidth to processor as functional unit • Own access to memory • Support multi-cycle operation • Allow state • Cycle counter to track operation • Configuration cache, path to memory
Garp – UC Berkeley • ISA – coprocessor operations • Issue gaconfig to make particular configuration present. • Explicitly move data to/from array • Processor suspension during coproc operation • Use cycle counter to track progress • Array may directly access memory • Processor and array share memory • Exploits streaming data operations • Cache/MMU maintains data consistency
Garp Instructions • Interlock indicates if processor waits for array to count to zero. • Last three instructions useful for context swap • Processor decode hardware augmented to recognize new instructions.
Garp Array • Row-oriented logic • Dedicated path for processor/memory • Processor does not have to be involved in array-memory path
Garp Results • General results • 10-20X improvement on stream, feed-forward operation • 2-3x when data dependencies limit pipelining • [Hauser-FCCM97]
PRISC/Chimaera vs. Garp • Prisc/Chimaera • Basic op is single cycle: expfu • No state • Could have multiple PFUs • Fine grained parallelism • Not effective for deep pipelines • Garp • Basic op is multi-cycle – gaconfig • Effective for deep pipelining • Single array • Requires state swapping consideration
Common Theme • To overcome instruction expression limits: • Define new array instructions. Make decode hardware slower / more complicated. • Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. • Give array configuration short “name” which processor can call out. • Store multiple configurations in array. Access as needed (DPGA)
ReMarc • Miyamori/Olukotun – Stanford • Array of “nano-processors” • 16b, 32 instructions each • VLIW –like instruction • Coprocessor interface (similar to Garp) • No direct array -> memory
ReMarc Architecture • 8x8 array of nanoprocessor • Reminiscent of DPGA except that processing element is ALU
Nanoprocessor Tile • Each tile has own instruction RAM • Communication with near-neighbor tiles • Global sequence specifies non-PC • 16 bit output.
ReMarc Results • ReMarc 60X smaller than FPGA • Performance comparable
Observation • All coprocessors have been single-threaded • Performance improvement limited by application parallelism • Potential for task/thread parallelism • DPGA • Fast context switch • Concurrent threads seen in discussion of IO/stream processor • Added complexity needs to be addressed in software.
Scalability? • Can scale…. • Number of inactive contexts. • Similar to cache model • Number of PFUs in PRISC/Chimaera • Still limited by single execution thread. • Exacerbate pressure/complexity of reconfigurable logic/interconnect • Cannot scale? • Amount of active resources. • Perhaps take coarser-grain focus to parallel processing.
Parallel Computation: Processor and FPGA • What would it take to let the processor and FPGA run in parallel? Modern Processors Deal with: • Variable data delays • Dependencies with data • Multiple heterogeneous functional units Via: • Register scoreboarding • Runtime data flow (Tomasulo)
OneChip -> Toronto • Allow array to have more memory-memory operations • Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation. • Also allow decoupled processor/array execution. • Allow interlocking of data in special “scoreboard” area.
0x0 0x1000 FPGA Proc 0x10000 Indicates usage of data pages like virtual memory system! OneChip Innovations • FPGA operates on certain memory regions only • Makes regions explicit to processor issue. • Scoreboard memory blocks
OneChip • Basic Op is FPGAMem -> Mem • No state between ops • Ops must appear sequential • Could have multiple/parallel FPGA compute units • Scoreboard between all • Multiprocessing?
Summary • Several different models and uses for “reconfigurable processor” • Some move towards parallel computing. Others towards single processors • Exploit density and expressiveness of fine-grained, parallel operations. • Number of ways to integrate. Need to work around limitations.