170 likes | 249 Views
Hardware Optimized DCT-IDCT Implementation on Verilog HDL RAHUL SRIKUMAR ECE734:VLSI ARRAY STRUCTURES FOR DSP 05/10/13. Contents. Algorithm Implementations Performance Results Conclusion Future Work. Algorithm. 8 point DCT 2D DCT = C * X *Transpose( C )
E N D
Hardware Optimized DCT-IDCT Implementation on Verilog HDL RAHUL SRIKUMARECE734:VLSI ARRAY STRUCTURES FOR DSP 05/10/13
Contents • Algorithm • Implementations • Performance • Results • Conclusion • Future Work
Algorithm • 8 point DCT • 2D DCT = C*X*Transpose(C) • C – coefficient matrix
Algorithm(Cont’d) • 1D DCT = C*X • 2D DCT = Transpose(1D DCT)* C • 1D IDCT = Transpose(C) * 2D DCT • 2D IDCT =Transpose(1D IDCT) * Transpose(C)
Implementations Part 1 Input word length – 8 bits 1D DCT internal word length – 11 bits 2D DCT output word length – 9 bits 2D IDCT output word length – 8 bits 4 implementations were evaluated Serial In (SI) – 1 pixel at a time 2 Parallel In (2PI) – 2 pixels at a time 4 Parallel In (4PI) – 4 pixels at a time 8 Parallel In (8PI) – 8 pixels at a time
Implementations Part 2 • 8 registers of 8 bits each for coefficient storage. • very efficient when compared to 64 registers required for • 8*8 DCT/IDCT computation. • 2 RAMS each of 64 locations(8 bit wide) are used. • RAMS are enabled in the order • en_ram1_write->(en_ram1_read, en_ram2_write) • ->en_ram2_read
Performance 1 • Serial In (1 pixel at a time) • Read 8 inputs = 8 cycles • Register 8 inputs + sign extension = 1 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 14 cycles
Performance 2 • 2 Parallel In (2 pixel at a time) • Register 8 inputs + sign extension = 4 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 9 cycles
Performance 3 • 4 Parallel In (4 pixel at a time) • Register 8 inputs + sign extension = 2 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 7 cycles
Performance 4 • 8 Parallel In (8 pixel at a time) • Register 8 inputs + sign extension = 1 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 6 cycles
Synthesis • Target Platform : ALTERA Cyclone IV GX FPGA • Tool Used : Quartus II • Language Used : Verilog
Results 1 • Serial In has lowest synthesized combinational • area because of lowest number of wires needed to • feed in the data.
Results 2 • Serial In has lowest synthesized area due to least • number of storage elements and counters required • to process the data.
Results 3 • 8 parallel In takes 236 cycles in contrast to 246 for • serial in.
Conclusion • Serial In occupies ~6% less area than 8 parallel In with a • performance degradation that is comparatively • lower(~4%).
References • A Fast Hybrid Dct Architecture Supporting H.264, Vc-1, • Mpeg-2, Avs And Jpeg Codecs by Muhammad Martuza, Carl McCrosky and Khan Wahid at • 11TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCES, SIGNAL PROCESSING • AND ITS APPLICATIONS. • An Area Efficient Dct Architecture For Mpeg-2 Video Encoder by KyeounsooKim • and Jong-SeogKohin IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 45, NO. 1, • FEBRUARY 1999. • Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse for MPEG-4 • Video Coding byHui-Cheng Hsu et. Al inIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS • FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008. • Integer DCT Based on Direct-Lifting of DCT-IDCT for Lossless-to-Lossy Image Coding by Taizo • Suzuki, Student Member, IEEE, and Masaaki Ikehara, Senior Member, IEEE in IEEE • TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 11, NOVEMBER 2010.