430 likes | 563 Views
Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs. Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine
E N D
Improving Embedded System Software Speed and Energy usingMicroprocessor/FPGA Platform ICs Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside
Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Amazing to think this came from wolves General Purpose vs. Special Purpose • Standard tradeoff Frank Vahid, UC Riverside
Controller Datapath Controller Datapath Control logic i Control logic and State register Register file total State register + General ALU IR PC Data memory Program memory Data memory Assembly code for: total = 0 for i =1 to … ENIAC, 1940’s Its flexibility was the big deal General Purpose vs. Single Purpose Processors total = 0 for i = 1 to N loop total += M[i] end loop • Designers have long known that: • General-purpose processors are flexible • Single-purpose processors are fast General purpose OR Single purpose Flexibility Design cost Time-to-market Performance Power efficiency Size Frank Vahid, UC Riverside
Digital camera chip CCD CCD preprocessor Pixel coprocessor D2A A2D lens JPEG codec Microcontroller Multiplier/Accumulator DMA controller Display control Memory controller ISA bus interface UART LCD control Mixing General and Single Purpose Processors • A.k.a. Hardware/software partitioning • Hardware: single-purpose processors • coprocessor, accelerator, peripheral, etc. • Software: general-purpose processors • Though hardware underneath! • Especially important for embedded systems • Computers embedded in devices (cameras, cars, toys, even people) • Speed, cost, time-to-market, power, size, … demands are tough Frank Vahid, UC Riverside
Informal spec How is Partitioning Done for Embedded Systems? • Partitioning into hw and sw blocks done early • During conceptual stage • Sw design done separately from hw design • Attempts since late 1980s to automate not yet successful • Partitioning manually is reasonably straightforward • Spec is informal and not machine readable • Sw algorithms may differ from hw algorithms • No compelling need for tools System Partitioning Sw spec Hw spec Sw design Hw design Processor ASIC Frank Vahid, UC Riverside
Informal spec Partitioning New Platforms Invite New Efforts in Hw/Sw Partitioning Processor + FPGA • New single-chip platforms contain both general-purpose processor and an FPGA • FPGA: Field-programmable gate array • Programmable just like software Flexible • Intended largely to implement single-purpose processors • Can we perform a later partitioning to improve the software too? System Partitioning Sw spec Hw spec Sw design Hw design Processor + FPGA ASIC Frank Vahid, UC Riverside
Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Commercial Single-Chip Microprocessor/FPGA Platforms • Triscend E5: based on 8-bit 8051 CISC core (2000) • 10 Dhrystone MIPS at 40MHz • up to 40K logic gates • Cost only about $4 Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms • Atmel FPSLIC • Field-Programmable System-Level IC • Based on AVR 8-bit RISC core • 20 Dhrystone MIPS • 5k-40k logic gates • $5-$10 Courtesy of Atmel Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms • Triscend A7 chip (2001) • Based on ARM7 32-bit RISC processor • 54 Dhrystone MIPS at 60 MHz • Up to 40k logic gates • $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms • Altera’s Excalibur EPXA 10 (2002) • ARM (922T) hard core • 200 Dhrystone MIPS at 200 MHz • ~200k to ~2 million logic gates Source: www.altera.com Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms • Xilinx Virtex II Pro (2002) • PowerPC based • 420 Dhrystone MIPS at 300 MHz • 1 to 4 PowerPCs • 4 to 16 gigabit transceivers • 12 to 216 multipliers • Millions of logic gates • 200k to 4M bits RAM • 204 to 852 I/O • $100-$500 (>25,000 units) • Up to 16 serial transceivers • 622 Mbps to 3.125 Gbps PowerPCs Config. logic Courtesy of Xilinx Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms • Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? • One argument against – area • Lots of silicon area taken up by FPGA • FPGA about 20-30 times less area efficient than custom logic • FPGA used to be for prototyping, too big for final products • But chip trends imply that FPGAs will be O.K. in final products… Frank Vahid, UC Riverside
How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside
How Much is Enough? Reasonably sized Frank Vahid, UC Riverside
How Much is Enough? Probably plenty big for most of us Frank Vahid, UC Riverside
How Much is Enough? More than typically necessary Frank Vahid, UC Riverside
IC package IC How Much Custom Logic is Enough? 1993: ~ 1 million logic transistors Perhaps a bit small 8-bit processor: 50,000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 1999: ~ 10-50 million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 2002: ~ 100-200 million logic transistors More than typically necessary Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 1993: 1 M 2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside
10,000 100,000 1,000 10,000 100 1000 Logic transistors per chip (in millions) Gap Productivity (K) Trans./Staff-Mo. 10 100 IC capacity 1 10 0.1 1 productivity 0.01 0.1 0.001 0.01 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Very Few Companies Can Design High-End ICs • Designer productivity growing at slower rate • 1981: 100 designer months ~$1M • 2002: 30,000 designer months ~$300M Design productivity gap Moore’s Law Source: ITRS’99 Frank Vahid, UC Riverside
Becoming out of reach of mainstream designers Single-Chip Platforms with On-Chip FPGAs • So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways • But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? Frank Vahid, UC Riverside
A football huddle can only get so small Shrink This area will exist whether we use it all or not Shrinking Chips • Yes, but there’s a limit • Chips becoming pin limited Pads connecting to external pins Frank Vahid, UC Riverside
Trend Towards Pre-Fabricated Platforms: ASSPs • ASSP: application specific standard product • Domain-specific pre-fabricated IC • e.g., digital camera IC • ASIC: application specific IC • ASSP revenue > ASIC • ASSP design starts > ASIC • Unique IC design • Ignores quantity of same IC • ASIC design starts decreasing • Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside
Microprocessor/FPGA Platforms • Trends point towards such platforms increasing in popularity • Can we automatically partition the software to utilize the FPGA? • For improved speed and energy Frank Vahid, UC Riverside
Ideal Partitioner Software Hardware Compilation Synthesis Processor ASIC/FPGA Software “Spec” Automatic Hardware/Software Partitioning • Since late 1980s – goal has been spec in, hw/sw out • But no successful commercial tool yet. Why? // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … // Thousands of lines like this in dozens of files Frank Vahid, UC Riverside
“Spec” Why No Successful Tool Yet? • Most research has focused on extensive exploration • Roots in VLSI CAD • Decompose problem into fine-grained operations • Apply sophisticated partitioning algorithms • Examples • Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. • Is this overkill? 1000s of nodes (like circuit partitioning) Partitioner Frank Vahid, UC Riverside
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 We Really Only Need Consider a Few Loops – Due to the 90-10 Rule • Recent appearance of embedded benchmark suites • Enables analysis understanding of the real problem • We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone • Currently examining EEMBC (embedded equivalent of SPEC) • UCR loop analysis tools based on SimpleScalar and Simics // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … Assigned each loop a number, sorted by fraction of contribution to total execution time Frank Vahid, UC Riverside
The 90-10 Rule Holds for Embedded Systems In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside
1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 % Remaining 0.4 % Remaining 0.4 Execution Time Execution Time 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Loop Loop 1000 1000 500 500 Speedup Speedup 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Loop Loop So Need We Only Consider the First Few Loops? Not Necessarily • What if programs were self-similar w.r.t. 90-10 rule? • Remove most frequent loop – 90-10 rule still hold? • Intuition might say yes – remove loop, and we have another program. • So we need only speedup the first few loops • After that, speedups are limited • Good from tool perspective! Frank Vahid, UC Riverside
Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips E5 IC • Used multimeter and timer to measure performance and power • Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A7 development board Frank Vahid, UC Riverside
Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Simulation-Based Results for More Benchmarks (Quicker than physical implementation, results matched reasonably well) Frank Vahid, UC Riverside
Looking at Multiple Loops per Benchmark • Manually created several partitioned versions of each benchmarks • Most speedup gained with first 20,000 gates • Surprisingly few gates! • Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 • Stitt and Vahid, IEEE Design and Test, Dec. 2002 • J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside
Ideal Speedups for Different Architectures • Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 • Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside
Ideal Energy Savings for Different Architectures • Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0 • Energy savings quite resilient to variations Frank Vahid, UC Riverside
Informal spec How is Automated Partitioning Done? Previous data obtained manually System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside
Source-Level Partitioning SW Source _______ _______ _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Compiler Back-End Hw source Assembly & object files Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Assembler & Linker Synthesis Binary Netlists Processor FPGA Frank Vahid, UC Riverside
Problems with Source-Level Partitioning • Though technically superior, source-level partitioning • Disrupts standard commercial tool flow significantly • Requires special compiler (ouch!) • Multiple source languages, changing source languages • How deal with library code, assembly code, object code Compiler Front-end C Source Java Source C++ Source ? C SUIF Compiler C++ SUIF Compiler Frank Vahid, UC Riverside
Binary Partitioning SW Source _______ _______ _______ Compilation Assembly & object files Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source HDL is generated and synthesized, and binary is updated to use hardware Updated Binary Synthesis Netlists Processor FPGA Frank Vahid, UC Riverside
Binary-Level Partitioning Results (ICCAD’02) • Binary-Level • Average speedup, 1.4 • Average energy savings, 13% • Large area overhead averaging 10,325 gates • Source-Level • Average speedup, 1.5 • Average energy savings, 27% • Average 4,361 gates Frank Vahid, UC Riverside
Mem Processor D$ I$ Profiler Config. Logic Mem DMA Proc. Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning • Dynamic software optimization gaining interest • e.g., HP’s Dynamo • What better optimization than moving to FPGA? • Add component on-chip: • Detects most frequent sw loops • Decompiles a loop • Performs compiler optimizations • Synthesizes to a netlist • Places and routes the netlist onto (simple) FPGA • Updates sw to call FPGA • Self-improving IC • Can be invisible to designer • Appears as efficient processor • HARD! Much future work. Frank Vahid, UC Riverside
Conclusions • Hardware/software partitioning can significantly improve software speed and energy • Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive • Successful commercial tool still on the horizon • Binary-level partitioning may help in some cases • Source-level can yield massive parallelism (Profs. Najjar/Payne) • Future dynamic hw/sw partitioning possible? • Distinction between sw/hw continually being blurred! • Many people involved: • Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… • Support from NSF, Triscend, and soon SRC… • Exciting new directions! Frank Vahid, UC Riverside