470 likes | 491 Views
Warp Processors Towards Separating Function and Architecture. Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC Irvine
E N D
Warp ProcessorsTowards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Motorola
2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware µP I$ 5 D$ Partitioned application executes faster with lower energy consumption (speed has been “warped”) Warp Config. Logic Architecture Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary Main IdeaWarp Processors – Dynamic HW/SW Partitioning Profiler µP I$ D$ Warp Config. Logic Architecture Dynamic Part. Module (DPM)
Binary Binary Standard Compiler Profiling Processor Processor1 Processor Processor2 Processor Processor3 Separating Function and Architecture • Benefits to “standard binary” for microprocessor • Concept: separate function from detailed architecture • Uniform, mature development tools • Same binary can run on variety of architectures • New architectures can be developed and introduced for existing applications • Trend towards dynamic translation and optimization of function in mapping to architecture SW ______ ______ ______ SW ______ ______ ______
Profiler Profiler Critical Regions HW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ IntroductionPartitioning software kernels or all of sw to FPGA • Improvements eclipse those of dynamic software methods • Speedups of 10x to 1000x • Far more potential than dynamic SW optimizations (1.3x, maybe 2-3x?) • Energy reductions of 90% or more • Why not more popular? • Loses benefits of standard binary • Non-standard chips (didn’t even exist a few years ago) • Special tools, harder to design/test/debug, ... SW ______ ______ ______ SW ______ ______ ______ Processor FPGA Commonly one chip today
Single-Chip Microprocessor/FPGA Platforms Appearing Commercially FPGAs are big, but Moore’s Law continues on… and mass-produced platforms can be very cost effective FPGA next to processor increasingly common Courtesy of Atmel Courtesy of Altera PowerPCs Courtesy of Triscend Courtesy of Xilinx
SW Binary Binary Binary Standard Compiler Profiling Binary Partitioner Modified Binary Netlist Netlist ASIC/FPGA Processor Processor Processor IntroductionBinary-Level Hardware/Software Partitioning • Can we dynamically move software kernels to FPGA? • Enabler – binary-level partitioning and synthesis • [Stitt & Vahid, ICCAD’02] • Partition and synthesize starting from SW binary • Initially desktop based • Advantages • Any compiler, any language, multiple sources, assembly/object support, legacy code support • Disadvantage • Loses high-level information • Quality loss? Traditional partitioning done here
IntroductionBinary-Level Hardware/Software Partitioning Stitt/Vahid’04
SW Binary Binary Binary Standard Compiler Profiling CAD Proc. FPGA IntroductionBinary Partitioning Enables Dynamic Partitioning • Dynamic HW/SW Partitioning • Embed partitioning CAD tools on-chip • Feasible in era of billion-transistor chips • Advantages • No special desktop tools • Completely transparent • Avoid complexities of supporting different FPGA types • Complements other approaches • Desktop CAD best from purely technical perspective • Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD • Back to “standard binary” – opens processor architects to world of speedup using FPGAs
Binary Updated Binary HW Binary Binary Binary Logic Synthesis Placement & Routing Technology Mapping RT Synthesis Binary Updater Partitioning Decompilation Profiler uP I$ D$ Config. Logic Arch. DPM Warp ProcessorsTools & Requirements • Warp Processor Architecture • On-chip profiling architecture • Configurable logic architecture • Dynamic partitioning module DPM with uP overkill? Consider that FPGA much bigger than uP. Also consider there may be dozens or uP, but all can share one DPM.
Decomp. Partitioning Tech. Map RT Syn. Log. Syn. Route Place 10 MB 10 MB 10 MB 10 MB 20 MB 50 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsAll that CAD on-chip? • CAD people may first think dynamic HW/SW partitioning is “absurd” • Those CAD tools are complex • Require long execution times on powerful desktop workstations • Require very large memory resources • Usually require GBytes of hard drive space • Costs of complete CAD tools package can exceed $1 million • All that on-chip?
Updated Binary Binary HW Binary Binary Binary Logic Synthesis Placement & Routing Technology Mapping RT Synthesis Binary Updater Partitioning Decompilation Warp ProcessorsTools & Requirements • But, in fact, on-chip CAD may be practical since specialized • CAD • Traditional CAD -- Huge, arbitrary input • Warp Processor CAD -- Critical sw kernels • FPGA • Traditional FPGA – huge, arbitrary netlists, ASIC prototyping, varied I/O • Warp Processor FPGA – kernel speedup • Careful simultaneous design of FPGA and CAD • FPGA features evaluated for impact on CAD • CAD influences FPGA features • Add architecture features for kernels Profiler uP I$ D$ Config. Logic Arch. Config. Logic Arch. DPM
Profiler uP I$ D$ Reg0 Reg1 Reg2 Config. Logic Arch. DPM Warp ProcessorsConfigurable Logic Architecture • Loop support hardware • Data address generators (DADG) and loop control hardware (LCH), found in digital signal processors – fast loop execution • Supports memory accesses with regular access pattern • Synthesis of FSM not required for many critical loops • 32-bit fast Multiply-Accumulate (MAC) unit Lysecky/Vahid, DATE’04 DADG & LCH 32-bit MAC Configurable Logic Fabric
SM SM SM CLB CLB SM SM SM 0 1 2 3 0L 1L 2L 3L Profiler DADG LCH e a b c d f ARM I$ 3L 3L 32-bit MAC SM SM SM 2L 2L D$ Configurable Logic Fabric 1L 1L LUT LUT 0L 0L CLB CLB Adj. CLB Adj. CLB Config. Logic Arch. 3 3 DPM 2 2 1 1 SM SM SM 0 0 o1 o2 o3 o4 0 1 2 3 3L 0L 1L 2L Warp ProcessorsConfigurable Logic Fabric • Simple fabric: array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) • Simple CLB: Two 3-input 2-output LUTs • carry-chain support • Simple switch matrices: 4-short, 4-long channels • Designed for simple fast CAD Lysecky/Vahid, DATE’04
Profiler uP HW Binary Updated Binary Binary Binary Binary I$ D$ RT Synthesis Placement & Routing Logic Synthesis Technology Mapping Binary Updater WCLA Partitioning Decompilation Decompilation DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM) • Dynamic Partitioning Module • Executes on-chip partitioning tools • Consists of small low-power processor (ARM7) • Current SoCs can have dozens • On-chip instruction & data caches • Memory: a few megabytes
Warp ProcessorsDecompilation Software Binary Software Binary • Goal: recover high-level information lost during compilation • Otherwise, synthesis results will be poor • Utilize sophisticated decompilation methods • Developed over past decades for binary translation • Indirect jumps hamper CDFG recovery • But not too common in critical loops (function pointers, switch statements) Binary Parsing Binary Parsing CDFG Creation CDFG Creation Control Structure Recovery Control Structure Recovery discover loops, if-else, etc. Removing Instruction-Set Overhead Removing Instruction-Set Overhead reduce operation sizes, etc. Undoing Back-End Compiler Optimizations Undoing Back-End Compiler Optimizations reroll loops, etc. Alias Analysis allows parallel memory access Alias Analysis Annotated CDFG Annotated CDFG
Warp ProcessorsDecompilation Results • In most situations, we can recover all high-level information • Recovery success for dozens of benchmarks, using several different compilers and optimization levels:
<1s Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements
Profiler uP Binary Updated Binary HW Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation RT Synthesis Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)
1 r1 DADG Read r2 + Read r2 + + r1 r3 Warp ProcessorsRT Synthesis • Converts decompiled CDFG to Boolean expressions • Maps memory accesses to our data address generator architecture • Detects read/write, memory access pattern, memory read/write ordering • Optimizes dataflow graph • Removes address calculations and loop counter/exit conditions • Loop control handled by Loop Control Hardware • Memory Read • Increment Address r3 Stitt/Lysecky/Vahid, DAC’03
r1 r3 8 r2 + < r4 r5 Warp ProcessorsRT Synthesis • Maps dataflow operations to hardware components • We currently support adders, comparators, shifters, Boolean logic, and multipliers • Creates Boolean expression for each output bit of dataflow graph 32-bit adder 32-bit comparator r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. Stitt/Lysecky/Vahid, DAC’03
<1s <1s .5 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s
Profiler uP Binary Updated Binary HW Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation Logic Synthesis Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)
Logic Synthesis r1 4 r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] … + r2 r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2] xor carry[1] r2[3] = r1[3] xor carry[2] … Warp ProcessorsLogic Synthesis • Optimize hardware circuit created during RT synthesis • Large opportunity for logic minimization due to use of immediate values in the binary code • Utilize simple two-level logic minimization approach Stitt/Lysecky/Vahid, DAC’03
Expand Reduce Irredundant on-set dc-set off-set Warp Processors - ROCM • ROCM – Riverside On-Chip Minimizer • Two-level minimization tool • Utilized a combination of approaches from Espresso-II [Brayton, et al. 1984] and Presto [Svoboda & White, 1979] • Eliminate the need to compute the off-set to reduce memory usage • Utilizes a single expand phase instead of multiple iterations • On average only 2% larger than optimal solution for benchmarks Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03
ROCM executing on 40MHz ARM7 requires less than 1 second Small code size of only 22 kilobytes Average data memory usage of only 1 megabyte Warp Processors - ROCMResults 40 MHz ARM 7 (Triscend A7) 500 MHz Sun Ultra60 Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03
<1s <1s 1s 1 MB .5 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s
Profiler uP Binary HW Updated Binary Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation Placement and Routing Technology Mapping Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)
Warp ProcessorsTechnology Mapping/Packing • ROCPAR – Technology Mapping/Packing • Decompose hardware circuit into basic logic gates (AND, OR, XOR, etc.) • Traverse logic network combining nodes to form single-output LUTs • Combine LUTs with common inputs to form final 2-output LUTs • Pack LUTs in which output from one LUT is input to second LUT • Pack remaining LUTs into CLBs Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Warp ProcessorsPlacement • ROCPAR – Placement • Identify critical path, placing critical nodes in center of configurable logic fabric • Use dependencies between remaining CLBs to determine placement • Attempt to use adjacent cell routing whenever possible Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03
<1s <1s 1s <1s <1s .5 MB 1 MB 1 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s
Route Done! illegal? congestion? Routing Resource Graph Rip-up yes no Warp ProcessorsRouting • FPGA Routing • Find a path within FPGA to connect source and sinks of each net • VPR – Versatile Place and Route [Betz, et al., 1997] • Modified Pathfinder algorithm • Allows overuse of routing resources during each routing iteration • If illegal routes exists, update routing costs, rip-up all routes, and reroute • Increases performance over original Pathfinder algorithm • Routability-driven routing: Use fewest tracks possible • Timing-driven routing: Optimize circuit speed
Route Done! illegal? congestion? Routing Resource Graph Routing Resource Graph Rip-up yes no Warp Processors Routing • Riverside On-Chip Router (ROCR) • Represent routing nets between CLBs as routing between SMs • Resource Graph • Nodes correspond to SMs • Edges correspond to short and long channels between SMs • Routing • Greedy, depth-first routing algorithm routes nets between SMs • Assign specific channels to each route, using Brelaz’s greedy vertex coloring algorithm • Requires much less memory than VPR as resource graph is much smaller Lysecky/Vahid/Tan, submitted to DAC’04
Warp Processors Routing: Performance and Memory Usage Results • Average 10X faster than VPR (TD) • Up to 21X faster for ex5p • Memory usage of only 3.6 MB • 13X less than VPR Lysecky/Vahid/Tan, to appear in DAC’04
Warp ProcessorsRouting: Critical Path Results 32% longer critical path than VPR (Timing Driven) 10% shorter critical path than VPR (Routability Driven) Lysecky/Vahid/Tan, submitted to DAC’04
<1s <1s 1s <1s <1s 10s 1 MB 3.6 MB .5 MB 1 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s
Profiler ARM I$ D$ Config. Logic Arch. DPM ARM7 I$ D$ Xilinx Virtex-E FPGA Warp ProcessorsExperimental Setup • Warp Processor • Embedded microprocessor • Configurable logic fabric with fixed frequency 80% that of the microprocessor • Based on commercial single-chip platform (Triscend A7) • Used dynamic partitioning module to map critical region to hardware • Our CAD tools executed on a 75 MHz ARM7 processor • DPM active for ~10 seconds • Experiment: key tools automated; some other tasks assisted by hand • Versus traditional HW/SW Partitioning • ARM processor • Xilinx Virtex-E FPGA (maximum possible speed) • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 on desktop
Average loop speedup of 29x Warp Processors: Initial ResultsSpeedup (Critical Region/Loop)
Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1 Warp Processors: Initial ResultsSpeedup (overall application with ONLY 1 loop sped up)
Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74% Warp Processors: Initial ResultsEnergy Reduction (overall application, 1 loop ONLY)
Xilinx ISE 9.1 s Manually performed 60 MB ROCPAR 3.6MB 0.2 s Warp Processors Execution Time and Memory Requirements (on PC) 46x improvement On a 75Mhz ARM7: only 1.4 s
Multi-processor platforms • Multiple processors can share a single DPM • Time-multiplex • Just another processor whose task is to help the other processors • Processors can even be soft cores in FPGA • DPM can even re-visit same application in case use or data has changed uP uP uP uP uP uP uP uP DPM Shared by all uP Config. Logic Arch. uP uP
VHDL/Verilog Std. HW Binary Binary Binary Standard CAD Tools Profiling JIT FPGA Comp. JIT FPGA Comp. + + * * + + MEM FPGA FPGA Idea of Warp Processing can be Viewed as JIT FPGA compilation • JIT FPGA Compilation • Idea: standard binary for FPGA • Similar benefits as standard binary for microprocessor • Portability, transparency, standard tools • May involve microprocessor for compactness of non-critical behavior
Future Directions • Already widely known that mapping sw to FPGA has great potential • Our work has shown that mapping sw to FPGA dynamically may be feasible • Extensive future work needed on tools/fabric to achieve overall application speedups/energy improvements of 100x-1000x
Binary Binary Function Function FPGA1 Processor Processor1 Processor Processor + FPGA Processor Processor Processor Processor1 Processor Processor Processor + FPGA Ultimately… SW ______ ______ ______ • Working towards separation of function from architecture • Write application, create “standard binary” • Map binary to any microprocessor (one or more), any FPGA, or combination thereof • Enables improvements in function and architecture without the heavy interdependence of today SW ______ ______ ______ Standard Compiler Profiling
Publications & Acknowledgements All these publications are available at http://www.cs.ucr.edu/~vahid/pubs • Dynamic FPGA Routing for Just-in-Time FPGA Compilation, R. Lysecky, F. Vahid, S. Tan, Design Automation Conference, 2004. • A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conference (DATE), February 2004. • Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; to appear in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp. • A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ACM/IEEE ISSS/CODES conference, 2003. • Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conference, 2003. • On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conference, 2003. • The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, November/December 2002. • Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design, November 2002. We gratefully acknowledge financial support from the National Science Foundation and the Semiconductor Research Corporation for this work. We also appreciate the collaborations and support from Motorola, Triscend, and Philips/TriMedia.