400 likes | 562 Views
Techniques for Low Power Turbo Coding in Software Radio. Joe Antoon Adam Barnett. Software Defined Radio. Single transmitter for many protocols Protocols completely specified in memory Implementation: Microprocessors Field programmable logic. Why Use Software Radio?.
E N D
Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett
Software Defined Radio • Single transmitter for many protocols • Protocols completely specified in memory • Implementation: • Microprocessors • Field programmable logic
Why Use Software Radio? • Wireless protocols are constantly reinvented • 5 Wi-Fi protocols • 7 Bluetooth protocols • Proprietary mice and keyboard protocols • Mobile phone protocol alphabet soup • Custom DSP logic for each protocol is costly
So Why Not Use Software Radio? • Requires high performance processors • Consumes more power Inefficient general fork Inefficient Field-programmable fork Efficient application specific fork
Turbo Coding • Channel coding technique • Throughput nears theoretical limit • Great for bandwidth limited applications • CDMA2000 • WiMAX • NASA ‘s Messenger probe
Turbo Coding Considerations • Presents a design trade-off • Turbo coding is computationally expensive • But it reduces cost in other areas • Bandwidth • Transmission power
Reducing Power in Turbo Decoders • FPGA turbo decoders • Use dynamic reconfiguration • General processor turbo decoders • Use a logarithmic number system
Generic Turbo Encoder Data stream s Component Encoder p1 Interleave Component Encoder p2
Generic Turbo Decoder Decoder Decoder Interleave r q1 q2
Decoder Design Options • Multiple algorithms used to decode • Maximum A-Posteriori (MAP) • Most accurate estimate possible • Complex computations required • Soft-Output Viterbi Algorithm • Less accurate • Simpler calculations Decoder
FPGA Design Options • Goal Make an adaptive decoder Decoder Received Data Original sequence Parity Low power, accuracy High power, accuracy Tunable Parameter
Component Encoder • M blocks are 1-bit registers • Memory provides encoder state Generator Function M M
Encoder State 1 0 00 00 00 GF 01 01 01 1 0 10 10 10 11 11 11 1 0 Time
Viterbi’s Algorithm • Determine most likely output • Simulate encoder state given received values d0 d1 d2 s0 s1 s2 … r0 p0 r1 p1 r2 p2 Time
Viterbi’s Algorithm • Write: Compute branch metric (likelihood) • Traceback: Compute path metric, output data • Update: Compute distance between paths • Rank paths by path metric and choose best • For N memory: • Must calculate 2N-1 paths for each state
Adaptive SOVA • SOVA: Inflexible path system scales poorly • Adaptive SOVA: Heuristic • Limit to M paths max • Discard if path metric below threshold T • Discard all but top M paths when too many paths
Implementing in Hardware Control Branch Metric Unit Add Compare Select Survivor memory r q
Implementing in Hardware • Controller – • Control memory • select paths • Branch Metric Unit • Compute likelihood • Consider all possible “next” states • Add, Compare, Select • Append path metric • Discard paths • Survivor Memory • Store / discard path bits
Implementing in Hardware Add, Compare, Select Unit Present State Path Values Next State Path Values Path Distance Compute, ComparePaths > T Branch Values Threshold
Dynamic Reconfiguration • Bit Error Rate (BER) • Changes with signal strength • Changes with number of paths used • Change hardware at runtime • Weak signal: use many paths, save accuracy • Strong signal: use few paths, save power • Sample SNR every 250k bits, reconfigure
Experimental Results K (Number of encoder bits) proportional to average speed, power
Experimental Results • FPGA decoding has a much higher throughput • Due to parallelism
Experimental Results • ASOVA performs worse than commercial cores • However, in other metrics it is much better • Power • Memory usage • Complexity
Future Work • Use present reconfiguration means to design • Partial reconfiguration • Dynamic voltage scaling • Compare to power efficient software methods
Power-Efficient Implementation of a Turbo Decoder in SDR System • Turbo coding systems are created by using one of three general processor types • Fixed Point (FXP) • Cheapest, simplest to implement, fastest • Floating Point (FLP) • More precision than fixed point • Logarithmic Numbering System (LNS) • Simplifies complex operations • Complicates simple add/subtract operations
Logarithmic Numbering System • X = {s, x = log(b)[|x|]} • S = sign bit, remaining bits used for number value • Example • Let b = 2, • Then the decimal number 8 would be represented as log(2)[8] = 3 • Numbers are stored in computer memory in 2’s compliment form (3 = 01111101) (sign bit = 0)
Why use Logarithmic System? • Greatly simplifies multiplication, division, roots, and exponents • Multiplication simplifies to addition • E.g. 8 * 4 = 32, LNS => 3 + 2 = 5 (2^5 = 32) • Division simplifies to subtraction • E.g. 8 / 4 = 2, LNS => 3 – 2 = 1 (2^1 = 2)
Why use Logarithmic System? • Roots are done as right shifts • E.g. sqrt(16) = 4, LNS => 4 shifted right = 2 (2^2 = 4) • Exponents are done as left shifts • E.g. 8^2 = 64, LNS => 3 shifted left = 6 (2^6 = 64)
So why not use LNS for all processors? • Unfortunately addition and subtraction are greatly complicated in LNS. • Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z] • Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z] • Where z = y – x • Turbo coding/decoding is computationally intense, requiring more mults, divides, roots, and exps, than adds or subtracts
Turbo Decoder block diagram • Use present reconfiguration means to design • Partial reconfiguration • Dynamic voltage scaling • Compare to power efficient software methods • Each bit decision requires a subtraction, table look up, and addition
Proposed new block diagram • As difference between e^a and e^b becomes larger, error between value stored in lookup table vs. computation becomes negligible. • For this simulation a difference of >5 was used
How it works • For d > 5 • New Mux (on right) ignores SRAM input and simply adds 0 to MAX result. • d > 5, pre-Decoder circuitry disables the SRAM for power conservation.
Comparing the 3 simulations • Comparisons were done between a 16-bit fixed point microcontroller, a 16-bit floating point processor, and a 20-bit LNS processor. • 11-bits would be sufficient for FXP and FLP, but 16-bit processors are much more common • Similarly 17-bits would suffice for LNS processor, but 20-bit is common type
Latency • Recall: Max*(a,b) = ln(e^a+e^b)
Power savings • Pre-Decoder circuitry adds 11.4% power consumption compared to SRAM read. • So when an SRAM read is required, we use 111.4% of the power compared to the unmodified system • However, when SRAM is blocked we only use 11.4% of the power we used before.
Power savings • The CACTI simulations for the system reported that the Max* operation accounted for 40% of all operations in the decoder • The Max* operations for the modified system required 69% of the power when compared to the unmodified system. • This leads to an overall power savings of 69% * 40% = 27.6%
Conclusion • Turbo codes are computationally intense, requiring more complex operations than simple ones • LNS processors simplify complex operations at the expense of making adding and subtracting more difficult
Conclusion • Using a LNS processor with slight modifications can reduce power consumption by 27.6% • Overall latency is also reduced due to ease of complex operations in LNS processor when compared to FXP or FLP processors.