Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department

Split-Row: A Reduced Complexity, High Throughput Low Density Parity Check (LDPC) Decoder Architecture Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis

Outline • Introduction to LDPC Codes • Split-Row Decoder Algorithm • Error Performance Comparison • Decoder Implementation Results • Conclusion

Error Correction in Communication Systems Error correction is widely used in most communication systems.

LDPC Codes Applications • Standards: • 10 Gigabit Ethernet (10GBASE-T): 2006 • Digital Video Broadcasting (DVB-S2):2005 • Next generation of WiFi and WiMAX • Problems with current LDPC decoders • Lack of enough memory bandwidth • High interconnect complexity [www.ieee802.org/3/an/ ]

LDPC Coding Transmitter: Noisy Channel Encoded Image Receiver: Decoded Image Received Image Iteration 1 Iteration 14 Modified images from [Maccay 2001]

α Received information from channel Row processing Column processing β Row Row processing processing α β Col Col Row Processing processing processing é ù 0 0 1 1 0 0 0 1 0 ê ú Error Error 1 0 0 0 1 0 0 0 1 ê ú correction correction ê ú 0 1 0 0 0 1 1 0 0 ê ú Column Processing Parity check Parity check 0 0 1 0 1 0 1 0 0 ê ú ê ú 1 0 0 0 0 1 0 1 0 ê ú ê ú 0 1 0 1 0 0 0 0 1 ë û LDPC Decoding: Message Passing Algorithm • Performs row and column operations iteratively.

Serial Decoders • One or a few row and column processing units. • Features • Simple • Small area • Small number of memories • Disadvantages • Low memory bandwidth • Low throughput : 100 Kbps-10Mbps

Full Parallel Decoders • Row and column processors are directly mapped according to the parity check matrix • High throughput • Disadvantages • Large circuit area • High interconnect complexity • Example: 2048-bit, 10GBASE-T • Row weight=32, Col weight=6, quantization bit=5 • 139 mm2 in 0.18 µm CMOS • 122,000long inter-processor wires • 1.3 Gbps

Outline • Introduction to LDPC Codes • Split-Row Decoder Algorithm • Error Rate Comparison • Decoder Implementation Results • Conclusion

Key Features of Split-Row Decoder • Row processing (dominates decoder complexity) • Increased parallelism • Reduced number of memory accesses • Reduced processor complexity • Results: • Smaller decoder area and higher utilization • Lower interconnect complexity • Higher throughput • Simpler hardware implementation

N columns columns columns N/2 N/2 row weight= Wr row weight= row weight= Wr /2 Wr/2 Standard vs. Split-Row Decoder Standard Decoder Split-Row Decoder

Split-Row Algorithm-Mathematical View • The magnitude part of the row processor output α, is larger for the Split-Row decoder • By normalizing the α values with a scale factor S<1 the error performance of Split-Row decoder is improved

Outline • Introduction to LDPC Codes • Split-Row Decoder Algorithm • Error Performance Comparison • Decoder Implementation Results • Conclusion

Bit Error Rate Performance Comparison Code length: 1536 bits Message length: 1155 bits Row weight: 16 Column weight:4 No. of iterations:15 MS: MinSum MS Split-Row: MinSum- Split Row S: Scale factor 0.6dB

Bit Error Rate Performance Comparison Code length: 2048 bits Message length: 1723 bits Row weight: 32 Column weight:6 No. of iterations:15 MS: MinSum MS Split-Row: MinSum- Split Row S: Scale factor 0.3dB

Outline • Introduction to LDPC Codes • Split-Row Decoder Algorithm • Error Rate Comparison • Decoder Implementation Results • Conclusion

A Full-Parallel Decoder Implementation • LDPC code example: • Code length=1536 bits • Message length=770 bits • Row weight=6 • Col weight=3 In Split-Row decoder: • Total no. of wires between each half is 3% of total wires. • Row processors in each half are 2.7 times smaller • Each row processor in each half is connected to only 3 column processors

Full Parallel Decoder Architecture 0.18 µm CMOS Technology, 6M layer • Split-Row, each half includes: • 768 row processors • 768 column processors Standard MinSum

Split-Row vs. Standard Decoder (mm) (mm2) (MHz) (Gbps) • 1536-bit (3,6) Quasi-cyclic LDPC code • No. of quantization bits is set to 5 bits per message. • For throughput computation no. of decoding iterations is set to 15. • Reported numbers are based on chip implementation results in 0.18 µm

Conclusion • Split-Row decoder method provides a significant reduction in circuit area • Results in: • Reduced wire interconnect complexity • Increased circuit area utilization • Increased speed • Simpler implementation • A good tradeoff between hardware complexity and error performance

Acknowledgments • Intel Corporation • UC Micro • NSF Grant No. 0430090 • UCD Faculty Research Grant

MinSum: Message Passing (Row processing )

Message Passing (Column processing ) λjis the received information.

é ù 0 0 1 1 0 0 0 1 0 ê ú α 1 0 0 0 1 0 0 0 1 ê ú ê ú 0 1 0 0 0 1 1 0 0 = ê ú H 0 0 1 0 1 0 1 0 0 ê ú α ê ú 1 0 0 0 0 1 0 1 0 ê ú ê ú 0 1 0 1 0 0 0 0 1 ë û y 1 λ1

= 0 (Stop decoding) ≠0 (Repeat decoding)

LDPC Codes • An LDPC code is defined by a binary matrix called parity check matrix H. • Rows define parity check equations (constrains) between encoded symbols in a code word and columns define the length of the code. • V is a valid code word if H٠Vt=0 • Decoder in the receiver checks if the condition H٠Vt=0 is valid. • Example : Parity check matrix for (9, 5) LDPC code, row weight=4, column weight =2:

Row Proc. Col. Proc. Row and Column Processor Architecture

Row+Col Procs. Right Row+Col Procs. left

Throughput=Clk*Code length/Imax • P=cfv2

What is the critical path and how you make sure that sign is computed correctly? • Answer: the critical path is the sign computation, which depends on the other side. The statistical timing analysis in place and route reports the slowest path delay, so it will make sure that the circuit works correctly. • Why the decoder chip becomes smaller even when you make it into half? • Answer: first the size and total no of col processors doesn’t change. The main benefit comes from the row processor which gets smaller than twice. The reason is that inside row processor there are different stages of comparators and they decrease more than twice when the number of inputs reduces to half. • You mentioned the design is power efficient but you didn’t report any power numbers • Answer: For this paper we didn’t get the power numbers, but it can be estimated from the fact the major energy comes from the wires (p=1/2cf^2) and we can say it’s scaled down linearly so it’s about 58% reduction. • Are there other works close to your design?

Which applications can tolerate this error performance loss? • This a very broad question. It really depends on the power budget and how much low you want to go on ber. • What is the difference between viterbi and LDPC code? • What is the difference between the turbo and LDPC? • If don’t know the answer: • I was not involved in That part of project but from what I know …. • Review the previous works • If asked why the chip figure is not square? • If somebody asked: the way yu proposed didn’t decrease the no of wires how do you say that it decreases the interconncet complexity. • You should notice that we are talking about long wires. Because when there is a large no of wires conincting one

Hard decision vs. soft: • In hard decision decoding each received symbol is thresholded to yield a single received bit as input to the decoding algorithm and messages passed between variable and check nodes as single bit only In soft decision decoding, multiple bits are used to represent each received symbol and the messages passed between variable and check node • How did you compute

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department

Presentation Transcript

ECE 681 VLSI Design Automation

ECE Projects Lab

ECE 425 - VLSI Circuit Design

ECE 681 VLSI Design Automation

ECE 425 - VLSI Circuit Design

VLSI Computation Lab University of California, Davis

CS123 Engineering Computation Lab Lab 1

Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department, UC Davis

CS122 Engineering Computation Lab Lab 2

Houshmand Shirani-mehr 1,2 , Tinoosh Mohsenin 3 , Bevan Baas 1

ECE 681 VLSI Design Automation

ECE 681 VLSI Design Automation

ECE 681 VLSI Design Automation

ECE Projects Lab

Tinoosh Mohsenin 2 , Houshmand Shirani-mehr 1 , Bevan Baas 1

ECE 681 VLSI Design Automation

Error Correction and LDPC decoding CMPE 691/491: DSP Hardware Implementation Tinoosh Mohsenin

ECE Projects Lab

Houshmand Shirani-mehr 1,2 , Tinoosh Mohsenin 3 , Bevan Baas 1

Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department, UC Davis

Hua Yang and Kenneth Rose Signal Compression Lab ECE Department