1 / 23

Arithmetic Acceleration Techniques for Wireless Communication Receivers

Http://www.ece.rice.edu/. Arithmetic Acceleration Techniques for Wireless Communication Receivers. Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University.

simon-tyson
Download Presentation

Arithmetic Acceleration Techniques for Wireless Communication Receivers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Http://www.ece.rice.edu/ Arithmetic Acceleration Techniques for Wireless Communication Receivers Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF

  2. Objective • Next generation Wireless Base-station • Real-Time Requirements • Multiuser Channel Estimation and Detection • High Complexity Algorithms for Advanced Receiver Structures • Task Decomposition • Potential for parallelism • Application-Specific Design / Single Processor

  3. Outline • Motivation • Real-time Requirements • Joint Estimation and Detection • Task Decomposition • Results • Summary

  4. Motivation • Next Generation Wireless Systems • Higher Data Rates , up to 2 Mbps • Multimedia Capabilities • Multi-rate, QoS • High Complexity in Proposed Algorithms • Pressure on existing hardware • Time, power, size constraints • Acceleration on Hardware Needed

  5. Noise +MAI Base Station Reflected Paths Direct Path User 1 User 2 Wireless Communication Uplink • Asynchronous CDMA System • Multiple Users • Channel Effects • Fading • Multiple paths • Multiple Access Interference

  6. Base-station Receiver Antenna Data Multiuser Detection Decoder Detected Bits Delay Decision Feedback Multiple Users + Demod -ulator Channel Estimation d MU X MU X Pilot b Base-Station Receiver The Physical Layer

  7. Real -Time Requirements • W-CDMA • Transmission done by multiplication of signature waveform (Spreading) • Data Transmission in 10 ms Frames • Multiple Data Rates by Varying Spreading Factors • Detection needs to be done in real-time • 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps

  8. Joint Estimation and Detection • Algorithm to jointly estimate the channel response and detect all the user’s bits. • Shown to have better performance as well as reduced computational complexity. • Maximum Likelihood Based Channel Estimation • [C.Sengupta et al. : PIMRC’1998 WCNC’1999] • Differencing Multistage Detection based on Parallel Interference Cancellation • [G.Xu et al. : SPIE’1999]

  9. time bi-1 bi ri Computations Involved delay • Model • Compute Correlation Matrices Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users

  10. Multishot Detection Solve for the channel estimate, Ai Multishot Detection

  11. Differencing Multistage Detection • Stage 0 • Stage 1 • Successive Stages S=diag(AHA) y - soft decision d - detected bits (hard decision)

  12. Structure of AHA Block Bi-Diagonal Matrix

  13. Bottlenecks • Identify using C6x DSP Implementation • Channel Estimation • Can be done less frequently • Depends on BER needed • Multiuser Detection • Needs to be done all the time • Differencing Multistage • Less computations on successive stages • Analysis on Various levels of Optimization for Detection

  14. Task Decomposition Block I Block III Block II Task B Correlation Matrices (Per Bit) Inverse Matrix Products Block IV M UX d A0HA1 O(K2N) Multistage Detection (Per Window) RbbAH = Rbr[R] O(K2N) Rbr[R] O(KN) b A0HA0 O(K2N) Rbr[I] O(KN) M UX Data’ RbbAH = Rbr[I] O(K2N) d O(DK2Me) Rbb O(K2) A1HA1 O(K2N) Pilot AHr O(KND) Data Multistage Detection Channel Estimation Task A

  15. Sequential / Pipeline A B Task A Block IV d AHr O(KND) O(DK2Me) Data Real-time 1953 cycles,128 Kbps Task B 13272 cycles 3367*Me cycles (Single PE) Sequential : A+B: 13272 + 3367*Me : 10.7 Kbps (2 PE) Pipeline : A B : max(13272, 3367*Me) : 18.8 Kbps *Me =3

  16. (Parallel A) B Block IV Task A AHr O(ND) 1 O(DK2Me) Data d K Task B Real-time 1953 cycles,128 Kbps 3367*Me cycles 885 cycles (K+1 PE) Parallel A B : 3367*Me : 24.75 Kbps

  17. Parallel A Pipeline B Parallel A Parallel + Pipeline B Task A 1 K Task B Real-time 1953 cycles,128 Kbps 885 cycles O(N) 3367 cycles O(K2) 225 cycles O(K) (K +3 PE) Parallel A Pipeline B : 3367 : 74.25 Kbps ((Me+1)K PE) Parallel A Parallel + Pipeline B : 885 : 282.5 Kbps

  18. At this step Multistage Detection Block I &II 1 Data K Task A Stage 1 Stage2 Stage3… Block IV Block III Task B

  19. 5 x 10 Data Rates for Different Levels of Pipelining and Parallelism 3 2.5 (Parallel A) (Parallel+Pipe B) (Parallel A) (Pipe B) (Parallel A) B 2 A B Sequential A + B Data Rates 1.5 Data Rate Requirement = 128 Kbps 1 0.5 0 9 10 11 12 13 14 15 Number of Users Achieved Data Rates

  20. Mapping to Hardware • Analysis independent of hardware • DSP with coprocessors • Multiple Processors • Combination of a processor with ASIC/FPGA • Single ASIC • Minimize Idle time in processing elements • Some computations can be shared • Assumptions • Critical processing elements have functional units similar to C6x • No communication overhead between processors • Number of elements dependent on number of users

  21. Summary • Acceleration Techniques for Multiuser Estimation and Detection : computationally intensive algorithm • Task Decomposition • C6x DSP Simulator • Real-time Analysis • Hardware Mapping Issues • Application Specific Design more effective than a single processor solution

  22. Future Work • Fixed Point Implementation • LU Decomposition • Other Algorithms for decomposition • Matrix Oriented Architectures • Vector Processor with SIMD • 2 Levels of Parallelism • Complex Arithmetic

  23. DSP Implementation • Texas Instruments C6x Simulator • TI TMS320C6701 Floating Point DSP • Code and Program optimized to fit in internal memory • 32 -bit VLIW Architecture • 8 Functional Units • 2 Multipliers • 4 Adders • 2 Load/Store • TI C Compiler

More Related