160 likes | 319 Views
Codesigned On-Chip Logic Minimization. Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine
E N D
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship
1 Initialize Minimizer 2 Execute Minimizer 3 Indicate Completion Introduction(On-chip Logic Minimization) MEM Proc. I$ D$ ARM7 DMA MEM System-On-Chip On-chip Minimizer
138.23.16.9 138.23.16.x Port 7 138.23.x.x Port 5 125.x.x.x Port 3 On-Chip Minimization Applications (IP Routing Table Reduction) • IP routing table reduction • Routing tables of large network routers have over 30,000 entries • Fast IP routing lookup is difficult without using large hardware resources • Ternary CAM (McAuley & Francis, 1993) • TCAM can be used to perform routing table lookup in single cycle • Requires large resources and large power consumption • Mask Extension (Liu, 2002) • Uses two-level logic minimization to reduce the size of the routing table • Good results but did not considering off-chip communication Incoming IP packet Destination IP 138.23.16.9 Prefix Next hop Lookup IP in Routing Table Longest Prefix Match Port 7
Type Protocol In IP In Port Out IP Out Port Action On-Chip Minimization Applications (Access Control List Reduction) • Access Control List (ACL) • Used to restrict IP traffic through network routers • ACL size can range anywhere from from 300 (UCR CS&E Dept.) to 10,000 (AOL) • Common use is to block a particular protocol or port number to avoid attacks such as Denial of Service attacks • ACL Minimization • Similar approach as used for IP routing table reduction • However, order of the list must be preserved ACL Input Format
On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning) • Dynamic hardware/software partitioning (JIT compilation for FPGAs) • Dynamically detects frequently executed loop and re-implements the software loops using on-chip configurable logic • Requires logic synthesis tools to embedded on-chip Profiler MIPS/ARM I$ Warp Processor Warp Processor Warp Processor D$ Dynamic Partitioning Module Configurable Logic Warp Processor Warp Processor Warp Processor
ROCM • On-chip Logic Minimization Requirements • Limited data and instruction memory available • Quality of results must still be close to optimal • Execution time should remain reasonable • On-chip Logic Minimization Goal • Focus on developing an on-chip logic minimization tool that produces acceptable results with reasonable increases in execution time while using limited memory resources • ROCM – Riverside On-Chip Minimizer • Two-level minimization tool • Utilized a combination of approaches from Espresso-II (Brayton, et al. 1984) and Presto (Svoboda & White, 1979) • Eliminate the need to computer the off-set to reduce memory usage • Utilizes a single expand phase instead of multiple iterations • On average only 2% larger than optimal solution
ROCM executing on 40MHz ARM7 requires less than 1 second • Small code size of only 22 kilobytes • Average data memory usage of only 1 megabyte ROCM Results(Performance/Memory Usage) 40 MHz ARM 7 (Triscend A7) 500 MHz Sun Ultra60
Codesign ROCM(Hardware Coprocessor) • Customized ROCM enables us to develop an efficient hardware coprocessor • Profiled the execution of ROCM-32 and ROCM-128 using ARM port of the SimpleScalar simulator • Determine critical loops/functions that are suitable for implementation in hardware • Identified six critical kernels that comprised 91% of the total execution time but only 2% of the code size
data addr Proc/Mem Interface Tautology.1 IsCov SetLit Min. Coproc. Min. Coproc. Cofactor.1 DoesInter GetLit Minimization Coprocessor Codesign ROCM(Minimization Coprocessor) ARM7 MEM On-Chip Minimizer
aImpl dImpl numLits 64 64 5 << 1 << 32 (odd) 32 (even) Does Intersect DoesInter == 0 DoesIntersect retVal Codesign ROCM(Minimization Coprocessor) data addr Proc/Mem Interface Tautology.1 IsCov SetLit Cofactor.1 GetLit Minimization Coprocessor
Codesign ROCM Results(Execution Time) • Average speedup of 7.8
Codesign ROCM Results(Energy Consumption) • Average energy reduction of 59.2%
Codesign ROCM(Minimization Coprocessor) • Software modifications were required to achieve speedup of 7.8 • Data structures/algorithms not suitable for hardware implementation • Reorganized data structures • Customized width of data items • Eliminate memory allocation within critical regions • Not automated with current hardware/software partitioning tools
AddImplicant(cofactor, &coImplicant); 28.5% of total exec. time Only 3.5% of total exec. time Requires dynamic memory allocation Codesign ROCM(Minimization Coprocessor) for(i=0; i<F->numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; for(k=0; k<xj->numLiterals; k++) { // determine coImplicant ... } AddImplicant(cofactor, &coImplicant); } Move to HW Original C Code
Codesign ROCM(Minimization Coprocessor) // determine size of cofactor initially cofactorSize = 0; for(i=0; i<F->numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; cofactorSize++; } // allocate all memory outside of main loop cofactor->implicants = malloc(…); for(i=0; i<F->numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; for(k=0; k<xj->numLiterals; k++) { // additional initialization code need for each iterations coImplicant = &(cofactor->implicants[index++]); ... } } // determine size of cofactor initially // allocate all memory outside of main loop // additional initialization code need for each iterations Modified C Code
Conclusions & Future Work • Developed codesigned on-chip logic minimization • Performance improvement of nearly 8X compared to earlier software only implementation • Energy reduction of almost 60% • New directions in hardware/software partitioning • Designer effort was required to rewrite algorithms and fine tune data structures • Could better hardware/software partitioning tools automate this?