260 likes | 394 Views
Hardware accelerator for PPC microprocessor. Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri. Agenda. Ways to implement an algorithm Starting with ASC
E N D
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri
Agenda • Ways to implement an algorithm • Starting with ASC • HW architecture • SW architecture • System optimization • Generic module (iDCT) • Timing results
Abstract • Problem • There are complex functions (e.g. FFT) which takes a lot of CPU recourses • Consider the ways of implementation of such functions and choose the best solution according to specified constraints • Solutions • Pure SW implementation • Pure HW implementation • Combinational HW + SW - ASC technology
Abstract • SW • Low cost • Low performance • HW • High cost • High performance • Combinational
Project Goals • Study of ASC (A Stream compiler) • Study of functions in PamDC library • Implementation of interface between a generic module and the CPU using ASC • Implementation of some specific module to test the interface • Implementation of the same module in SW and make conclusions about performance
ASC - A Stream Compiler • Combinational (SW/HW) code • Familiar C++ writing • Generates a flexible HW • Standard NetList output (edif) • Supported by standard Cad tools • Provides HW optimization • UNIX oriented
ASC – code example #include "asc.h" main(int argc, char **argv) { printf("Hello World\n"); STREAM_START; // ASC code start // Hardware Variable Declarations HWint in(IN); HWint out(OUT); HWint tmp(TMP); STREAM_LOOP(16); tmp = (in << 1) + 55; out = tmp; STREAM_END; // ASC code end } Software Hello World Hardware 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87
System components • Memec evaluation board • Xilinx Virtex II Pro FPGA with PPC405 • JTAG • LCD, Serial port for debug • SW tools • Xilinx EDK • Xilinx Platform Studio • Chip Scope
System Bus (PLB) Design Approach - general • FPGA module Memory EDAC Memory EDAC Memory DRAM Peripheral module PPC405 Processor Peripheral Peripheral ASC Monitor module Monitor Monitor other peripheral
ASC interface (General view) Generic Module PLB bus Interrupt controller Fifo_full DMA engine CTRL CTRL Serdes Data_in FIFO_in Data DMA Buffer Data_out Addr FIFO_out Fifo_full
SW review – main algorithm Start/reset System blocks initialization(FIFO,DMA,GPIO,LCD) Yes DMA busy No Read data packets from ASC application Write data packets to ASC application No Yes Calculation complete
SW review – C code fundament • DMA – control and data TX/RX func. • LCD – setup and data TX func. • Data size manipulation • Timers control func. • MASK definition – user friendly orientation
iDCT abstract • Reconstructs an image or audio block from it’s discrete cosine transform • Why iDCT? Complex iterative algorithm which takes a lot of CPU resources
ASC design – IDCT module • Discrete Cosine Transform • This transform is utilized in the current standards for still images (JPEG) and video compression (MPEG). • The principle: Xm - matrix of discrete samples (iDCT samples) Tm - cosine coefficientmatrix Fm - DCT matrix
ASC design – Optimization (1) • ASC supports: • Latency • Throughput • Area • For large amount of data: Throughput – calculation time optimized
ASC design – Optimization (2) • Optimization… Throughput, Area, Latency?
ASC design – Optimization (3) • Optimization – Area consumption • Absolute values refer to Xilinx Virtex II Pro XC2VP7 FPGA
ASC design – Optimization (4) • Optimization – Area Consumption • Optimization by latency is the choice . Best throughput and latency, • with average area consumption
Clock calculations • Clock calculations Get time 1 Set DMA control Tx / Rx data packet complete No LCD write Data + calculation time Yes Get time 2 Calk_time = time2 – time1
SW performance 1400 1200 1000 800 Calculation Time [us] 600 400 200 0 1 2 4 8 16 17 20 24 28 30 32 38 40 44 46 48 50 54 60 64 68 72 74 78 80 84 88 90 94 96 98 104 100 108 112 116 120 Packet length (x*32) SW performance iDCT running results – SW (1) • Linear calculation time growth vs. data packet length as expected • in iDCT • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets
iDCT running results – SW (2) Exponential Data incease 100000000 10000000 1000000 100000 10000 log (Calculation time[us]) 1000 100 10 1 1 3 7 10 20 30 50 70 100 150 200 250 300 350 400 450 470 500 512 550 700 1000 10000 30000 50000 100000 300000 500000 1000000 log (Packet length) (x*32) Exponential Data increase • Exponential time calculation growth with exp. data length increasing
iDCT running results – HW (1) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO
iDCT running results – HW (2) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets
Innovations • Make this generic interface hard coded and include it as part of FPGA (IP) development packet. • Development becomes to C++ coding only • Interconnection between PPC & Generic Module becomes transparent • Make current design faster using separate DMA channels for read and write