370 likes | 462 Views
DEFACTO: Combining Parallelizing Compiler Technology with Hardware Behavioral Synthesis*. Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler. University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001
E N D
DEFACTO: Combining Parallelizing Compiler Technology with Hardware Behavioral Synthesis* Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 * The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F30602-98-2-0113.
Outline • Background & Motivation • Part 1: Application Mapping Example • Part 2: Design Space Exploration • Part 3: Challenges for Future FPGAs • Related Work • Conclusion
DEFACTO Objective & Goals • Objectives: • Automatically Map High-Level Applications to Field-Programmable Hardware (FPGAs) • Explore Multiple Design Choices • Goal: • Make Reconfigurable Technology Accessible to the Average Programmer
What Are FPGAs: • Key Concepts • Configurable Hardware • Reprogrammable (ms latency) • Architecture • Configurable Logic Blocks • “Universal” logic • Some input/outputs latched • Passive network between CLBs • Memories, processor cores
Why Use FPGAs? • Advantages over Application-Specific Integrated Circuits (ASICs) • Faster Time to Market • “Post silicon” Modification Possible • Reconfigurable, Possibly Even at Run-time • Advantages Over General-Purpose Processors • Application-Specific Customization (e.g., parallelism, small data-widths, arithmetic, bandwidth) • Disadvantages • Slow (typical automatic design @25MHz) • Low Density of Transistors
How to Program FPGAs? • Hardware-Oriented Languages • VHDL or Verilog • Very Low-Level Programming • Commercial Tools (e.g., MonetTM) • Choose Implementation Based on User Constraints • Time and Space Trade-Off • Provide Estimations for Implementation • Problem: Too Slow for Large Complex Designs • Place-and-Route Can Take up to 8 Hours for Large Designs • Unclear What to Do When Things Go Wrong
Behavioral Synthesis Example variable A is std_logic_vector(0..7) … X <= (A * B) - (C * D) + F 13 Registers 1 Multiplier 2 Adders/Subtractors 3 (shorter) clock cycles 9 Registers 2 Multipliers 2 Adders/Subtractors 2 (shorter) clock cycles 6 Registers 2 Multipliers 2 Adders/Subtractors 1 (long) clock cycle
Synthesizing FPGA Designs: Status • Technology Advances have led to Increasingly Large Parts • FPGAs now have Millions of “gates” • Current Practice is to Handcode Designs for FPGAs in Structural VHDL • Tedious and Error Prone • Requires Weeks to Months Even for Fairly Simple Designs • Higher-level Approach Needed!
DEFACTO: Key Ideas • Parallelizing Compiler Technology • Complements Behavioral Synthesis • Adjusts Parallelism and Data Reuse • Optimizes External Memory Accesses • Design Space Exploration • Evaluates and Compares Designs before Committing to Hardware • Improves Design Time Efficiency • a form of Feedback-directed Optimization
Opportunities: Parallelism & Storage Behavioral Synthesis Parallelizing Compiler Optimizations: Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across Iterations Supports User-Controlled Analysis Guides Automatic Loop Loop Unrolling Transformations Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip Considers one FPGA System-level View Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation
Part 1: Mapping Complete Designs from C to FPGAs Sobel Edge Detection Example
Example - Sobel Edge Detection char img[IMAGE_SIZE][IMAGE_SIZE], edge [IMAGE_SIZE][IMAGE_SIZE]; int uh1, uh2, threshold; for (i=0; i < IMAGE_SIZE - 4; i++) { for (j=0; j < IMAGE_SIZE - 4; j++) { uh1= (((-img[i][j]) + (- (2* img[i+1][j])) + (-img[i+2][j])) + ((img[i][j-2]) + (2* img[i+1][j-2]) + (img[i+2][j-2]))); uh2= (((-img[i][j]) + (img[i+2][j])) + (- (2* img[i][j-1])) + (2* img[i+2][j-1]) + ((- img[i][j-2]) + (img[i][j-2]))); if ((abs(uh1) + abs(uh2)) < threshold) edge[i][j]=”0xFF”; else edge[i][j]=”0x00; } } threshold 1 0 -1 2 0 -2 1 0 -1 edge img -1 -2 -1 0 0 0 1 2 1
Sobel - A Naïve Implementation img[i][j] img[i][j+1] img[i][j+2] img[i+1][j+2] img[i+1][j] edge[i][j] 0x00 0xFF img[i+2][j] img[i+2][j+1] img[i+2][j+2] • Large Number of Adders and Multipliers (shifts in this case) • Too Many Memory Accesses ! • 8 Reads and 1 Write per Iteration of the Loop • Observation • Across 2 Iterations 4 out of 8 Values Can Be Reused
Data Reuse Analysis - Sobel img[i][j] img[i][j+1] img[i][j+2] img[i+1][j] img[i+1][j+2] img[i+2][j] img[i+2][j+1] img[i+2][j+2] d = (1,0) d = (2,0) d = (1,0) d = (0,1) img[i][j] img[i][j+1] img[i][j+2] d = (0,2) img[i+1][j] img[i+1][j+2] d = (0,1) img[i+2][j] img[i+2][j+1] img[i+2][j+2]
Data Reuse using Tapped-Delay Lines 0x00 0xFF • Reduce the Number of Memory Accesses • Exploit Array Layout and Distribution • Packing • Stripping • Examples: img[i][j] img[i][j+1] img[i][j+2] img[i][j] img[i+1][j] img[i+2][j] edge[i][j] edge[i][j] 0x00 0xFF Accesses = 1.0 + 1.0 + 1.0 + 1.0 = 4.0 Accesses = 0.25 + 0.25 + 0.25 + 0.25 = 1.0
Overall Design Approach • Application Data-paths • Extract Body of Loops • Uses Behavioral Synthesis • Memory Interfaces • Uses Data Access Patterns to Generate Channel Specs • VHDL Library Templates Application Data-path MEM MEM Application Data-path MEM MEM
WildStarTM: A Complex Memory Hierarchy 32bits 64bits Shared Memory0 Shared Memory1 SRAM0 FPGA 1 FPGA 0 FPGA 2 SRAM2 SRAM1 SRAM3 PCI Controller Shared Memory2 Shared Memory3 To Off-Board Memory
Project Status Algorithm Description • Complex Infrastructure • Different Programming Languages (C vs. VHDL) • Different EDA Tools • Different Vendors • Experimental Target • In-House Tools • Combines Compiler Techniques and Behavioral Synthesis • Different Execution Models • Reconcile Representation • It Works! • Fully Automated for Single FPGA designs • Modest Manual Intervention for Multi-FPGA designs (simulation OK) Compiler Analysis Design Space Exploration Code Transformations and Annotations SUIF2VHDL Computation & Data Partitioning Behavioral Synthesis & Estimation (Monet) Memory Access Protocols Logic Synthesis (Synplicity) Place & Route (Xilinx Foundations) Annapolis WildStar Board
Sobel on the Annapolis WildStar Board Input Image Output Image Manual vs. Automated
Part 2: Design Space Exploration Using Behavioral Synthesis Estimates
Design Space Exploration(Current Practice) Design Specification (Low-level VHDL) • 2 Weeks for a Working Design • 2 Months for an Optimized Design Validation / Evaluation Logic Synthesis / Place&Route Design Modification Correct? Good design?
Design Space Exploration (Our Approach) Algorithm (C/Fortran) • Compiler Optimizations (SUIF) • Unroll and Jam • Scalar Replacement • Custom Data Layout Unroll Factor Selection SUIF2VHDL Translation Behavioral Synthesis Estimation Logic Synthesis / Place&Route • Overall, Less than 2 hours • 5 Minutes for Optimized Design Selection
Problem Statement Space Requirements Execution Time Exploit parallelism, Reuse data on chip More copies of operators, More on-chip registers • Constraint: Size of design less than FPGA capacity • Goal: Minimal execution time • Selection Criteria: For given performance, minimal space • Frees up more space for other computations • Better clock rate achieved • Desirable to use on-chip space efficiently
Balance • Definition: Data Fetch Rate Consumption Rate • Consumption Rate[bits/cycle] = data bits consumed per computation time • Limited by the Data Dependences of the Computation • Data fetch Rate[bits/cycle] = data bits required per computation time • Limited by the FPGA’s Effective Memory Bandwidth • If balance > 1, Compute Bound • If balance < 1, Memory Bound • Balance suggests whether more resources should be devoted to enhance computation or storage.
Loop Unrolling • Exposes fine-grain parallelism by replicating the loop body. Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1) Do I=1, N A(I) = A(I-2) + B(I) 2 2 A(I-2) A(I) A(I) B(I) A(I) 2 B(I) A(I-1) A(I+1) B(I+1) • As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase.
Monotonicity Properties Data Fetch Rate (bits/cycle) Data Consumption Rate (bits/cycle) unroll factor Saturation point unroll factor Balance (= Fetch/Consumption) Saturation point: unroll factor that saturates memory bandwidth for a given architecture Saturation point unroll factor
Balance & Optimal Unroll Factor 1 2 3 4 5 Rate (bits/cycle) Data fetch rate Data consumption rate Optimal solution Unroll factor max Saturation point Balance Guides the Design Space Exploration.
Experiments • Multimedia Kernels • FIR (Finite Impulse Response) • Matrix Multiply • Sobel (Edge Detection) • Pattern Matching • Jacobi (Five Point Stencil) • Methodology • Compiler Translates C to SUIF and Behavioral VHDL • Synthesis Tool Estimates Space and Computational Latency • Compiler Computes Balance and Execution Time Accounting for Memory Latency • Memory Latency • Pipelined: 1 cycle for read and write • Non-pipelined: 7 cycles for read and 3 cycles for write
FIR Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 + + Outer Loop Unroll Factor 16 x x Outer Loop Unroll Factor 32 * Outer Loop Unroll Factor 64 * Selected Design Speedup: 17.26
Matrix Multiply Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 + + Outer Loop Unroll Factor 16 x x Outer Loop Unroll Factor 32 Selected Design Speedup: 13.36
Efficiency of Design Space Exploration • On average, only 0.3% (15%) of Space Searched.
FIR: Estimation vs. Accurate Data • Larger Designs Lead to Degradation in Clock Rates • Compiler Can Use a Statistical Approach to Derive Confidence Intervals for Space • Our case: Compiler Makes Correct Decision using Imperfect Data
Part 3: Challenges for Future FPGAs Heterogeneous Functional and Storage Resources Data/Computation Partitioning and Scheduling Revisited
Field-Programmable-Core-Arrays IP Core ARM • Large Number of Transistors • Multiple Application Specific Cores • Customization of Interconnect • Other Specialized Logic • Challenges: • Data Partitioning: • Custom Storage Structures • Allocation, Binding and Scheduling • Replication and Reorganization • Computation Partition • Scheduling between Cores • Coarse-Grain Pipelining • Revisiting Issues with Parallelizing Compiler Technology IP Core S-RAM D-RAM IP Core DSP
Related Work • Compilers for Special-purpose Configurable Architectures • PipeRench (CMU), RaPiD (UW), RAW (MIT) • High-level Languages Oriented towards Hardware • Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL) • Integrated Compiler and Logic Synthesis • Babb (MIT), Nimble (Synopsys) • Compiling from MatLab to FPGAs • Match compiler (Northwestern)
Conclusion Combines Behavioral Synthesis and Parallelizing Compiler Technologies Fast & Automated Design Space Exploration Trades Space with Functional Units via Loop Unrolling Uses Balance andMonotonicity Properties Searches only 0.3% of the Entire Design Space Near-optimal Performance and Smallest Space Future FPGAs Coarser-grained, Custom Functional and Storage Structures Multiprocessor on a Chip Data and Computation Partitioning and Coarse Grain Scheduling