200 likes | 222 Views
Explore strategies to minimize data communication in hardware descriptions, focusing on the efficient mapping of high-level language applications to hardware. Learn about control nodes, SSA algorithms, and optimization techniques for reducing data movement.
E N D
Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering University of California, Los Angeles University of California, Santa Barbara June 4, 2003
From Algorithm to HDL Application specified in system-level language • We focus our efforts on mapping an application written in a high-level language to a hardware description. • We desire this mapping to have optimal characteristics (area, latency, etc.) • In this talk, we focus on the problem of minimizing data communication in the final hardware. HDL (behavioral, structural) Compiler Synthesis and Physical Design
Similar Compilation Projects Hardware compilers • Reconfigurable Architecture • PRISM project – synthesize subset of C to FPGA • Garp compiler (BRASS) – synthesize C toprocessor + FPGA platform • DEFACTO – synthesize SUIF to FPGA (Wildstar) • General Architecture • DeepC compiler – synthesize C to HDL • MATCH compiler – synthesize Matlab to HDL • PICO – synthesize nested loops into VLIW-like functional unit
Our Framework C Code Control Node 1 Control Node 2 Control Node 3 Control Node 4 • From the SUIF IR, we construct a CDFG representation. • Each basic block of the CDFG becomes a separate synthesizable module in the hardware description. SUIF/ MachSUIF Compiler Control Data-Flow Graph (CDFG) Hardware Description
Characterizing Data Communication • Two examples of data communication schemes Control Node 1 Memory (Register Bank, RAM) Control Node 1 Bus Control Node 2 Control Node 3 Control Node 2 Control Node 3 Control Node 4 Control Node 4 Distributed Centralized data communication = wire data communication = storage access
Identifying Data Communication • Global Data Communication = 5 variables • Determine relationship between place(s) where data is defined and where data is used a … • Naïve method: all use-points of a variable depend on all definitions of that variable • Not all use points “use” a variable b … a … b … a … c … b c a Need analysis to minimize the amount of data communication
Minimizing Data Communication a1 … a … b1 … b … a2 … a … b2 … b … a3 … a … c1 … c … b b1 c1 c a4 (a2,a3) a4 a • Must determine relationship between where data is generated and where data is used • Problem formulation: minimize the total number of bits communicated between all pairs of control nodes • SSA (Static Single Assignment) • Changes each variable to have a unique definition point • Must add -nodes to merge definitions
Using SSA to Minimize Data Communication Semi-Pruned Minimal Pruned a1 … a1 … a1 … b1 … b1 … b1 … a2 … a2 … a2 … b2 … b2 … b2 … a3 … a3 … a3 … c1 … c1 … c1 … b1 b1 b1 c1 c1 c1 a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) b3 (b1,b2) b3 (b1,b2) c2 (c1) a4 a4 a4 • SSA algorithms • Find location of -nodes • Rename variables • Three main SSA algorithms • Minimal, Pruned – Cytron et al. • Semi-pruned – Briggs et al. • Differ in number and location of -nodes • Minimal – insert -nodes at iterated dominance frontier (IDF) • Semi-pruned – insert -node at IDF if variable live outside some basic block • Pruned – insert -node at IDF if variable live at that time
Experimental Setup CDFG CDFG in SSA form HDL Generation Synopsys Behavioral / Design Compiler SSA Conversion
MediaBench Benchmark Suite • A benchmark suite of DSP applications[Lee et al] • DSP Applications well suited to hardware implementation • Tend to: • be parallelizable • be computationally intensive • often have large basic blocks for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Sample code: internal filter of an image convolver
Results: SSA for Data Comm. Minimization • Edge Weight w(i,j)– number of bits communicated from node i to j • Total Edge Weight (TEW) - corresponds to amount of data communication
Further Minimizing Data Communication a1 … a1 … b1 … b1 … a2 … a2 … b2 … b2 … a3 … a3 … c1 … c1 … b1 b1 c1 c1 a4 (a2,a3) a4 (a2,a3) TEW = 4 a4 a4 • Current SSA algorithms place -nodes temporally • In software compilation, live ranges should be short. • Appropriate in hardware? Spatial -node distribution Temporal -node distribution a1 … b1 … a2 … b2 … a3 … c1 … b1 c1 TEW = 3 a4 (a2,a3) a4
Effect of -node Distribution Spatial -node placement Temporal -node placement
Spatial -nodes Distribution Algorithm • d – number of uses of -node destination • s – number of -node source values • Number of temporal links • Number of spatial links s = 3 a3(a0,a1,a2) a3 a3 d = 2
Conclusion • In this work, we demonstrate a mapping from compiler IR (CDFG) to hardware description. • SSA binds variables to values, which is useful in reducing data communication between control nodes. • Spatial distribution of phi nodes can reduce data communication, modeled as total edge weight (TEW)by as much as 20%. • However, circuit area sometimes increases… • Future research: refine the model using information fromlater stages of synthesis. • Compiler techniques applied to hardware design can greatly reduce data communication.