480 likes | 620 Views
Overview of Implementation Issues for Multitier Networks on DSPs. Joseph R. Cavallaro Electrical & Computer Engineering Dept. Rice University August 17, 1999. Outline. Overview of Multitier Networks DSP Rapid Prototyping Tools Channel Estimation and Multistage Detection
E N D
Overview of Implementation Issues for Multitier Networks on DSPs Joseph R. Cavallaro Electrical & Computer Engineering Dept. Rice University August 17, 1999
Outline • Overview of Multitier Networks • DSP Rapid Prototyping Tools • Channel Estimation and Multistage Detection • DSP implementation and Real-time Issues • ASIC Implementation of Algorithm Modules • Conclusions and Future Directions
Multitier Overlay Networks Home Area Wireless LAN Outdoor CDMA Cellular Network High Speed Office Wireless LAN
Multiple Radio Interfaces Reconfigurability and Commonality of Modules Multitier Network Interface Card Time Scales in Multitier Networks
mNIC Server Mobile Platform Proxy File System Proxy Awareness Application mNIC I Transcoders N BS T N E Transcoders I R C BS Proxy File System N Network Protocols E File Network Protocols T BS System
Current Group • Suman Das - Universal Baseline Software System • Vishwas Sundaramurthy - System Design Issues • Sridhar Rajagopal - Channel Estimation Algorithms • Oscar Pan – Real Time Workshop Implementation • Recent Graduates: • Chaitali Sengupta - ML Synchronization • Gang Xu - Differencing Multistage Detector
W-CDMA Simulation Testbed Overview • Development of an integrated software testbed • Unified framework to evaluate new algorithms for coding, synchronization, detection, etc. • Construction of a faster, efficient, and possibly hardware accelerated simulation testbed • TI TMS320C6201- TMS320C6701 based system – Base Station • TI TMS320C54 and FPGA / ASIC - Mobile
Software Rapid Prototyping Methodology • Communication and Signal Processing Algorithms in MATLAB and “C” • Faster Execution of “C” Code • Acceleration on DSP Boards • Multiple DSP Boards C - CODE C mex - CODE MATLAB CODE MATLAB COMPILER C - CODE WRAPPER (C - Code or Simulink) HOST DSP CODE GENERATION TOOLS DSP CODE DSP hardware
Simulink • Simulink • Good system for algorithm evaluation in communication systems and signal processing • Ties in well with MATLAB environment and functions • More intuitive than (C/Matlab) code based evaluation • Used in software version of wireless testbed
RTW • Real-Time Workshop • Generates ANSI C-code for Simulink block diagrams • Tool for DSP rapid prototyping • Quick but inefficient/non-optimized C-code • RTW support for C67x generation boards • Hardware (DSP)-in-the-loop simulations
CDMA Wireless System Testbed Simulink Version Multiuser Detection Chip matched filter AWGN Channel Error Rate Calculation Chip MF Wireless Channel User_Data Multiuser Detector Decorrelating Error Counter Detector Channel Estimation Channel Estimation User Data Max. Likelihood Channel Est. Show Stats Update Parameters Parameters Statistics
Hardware Platform Issues • Current System • TI TMS320C6201 and TMS32C6701 EVM boards • Multiple DSP Processor Configuration Issues and Task Decomposition. • Planned Upgrade to BlueWave, Spectrum
DSPs in Simulink based Wireless testbed • Use of C67 based boards for simulations • Useful for study of individual algorithms on C67 generation processors • Multiprocessing issues • Need block diagram partitioning and code generation support from Simulink/RTW • Need cleaner external communication mechanisms in the C67x DSP • Need support for controlling multiple DSPs
Architectural Issues • Memory • More internal memory for large temporary matrices • Prefetch Buffers • Matrices stored as arrays in memory. • ASIC /FPGA glue support • To explore HW acceleration of critical parts of the code • Specialized instructions : Square roots, reciprocals, rotations ?
Compiler Support • Compilers for VLIW • Scheduling & Tracking units difficult in manual assembly • Challenge to generate code to keep all units busy. • Small Operating System Support • Architectural improvements require coordinated advances in compiler support.
W-CDMA Software Testbed Experiments • Third generation wireless communication systems • Multimedia capabilities • Multirate services • Quality of service • Higher Data Rates: 2 Mbps, 384 Kbps, 144 Kbps.
The Wireless Channel : Multiuser, Multipath Noise + MAI Direct Path Antenna Reflected Paths Desired User Faces Attenuation, Delays and Doppler Effects : Unknown Channel Parameters
W-CDMA Base-Station Receiver Antenna Multiuser Detector Data Demux Decoder Demodulator Estimated Amplitudes & Delays Pilot Channel Estimator
y1 Channel Decoder Multi- User Detector Matched Filter User 1 d1' Channel Encoder User 1 d1 Spreading AWGN y2 User 2 d2' Matched Filter Channel Encoder User 2 d2 Spreading + R(t) yK Matched Filter User K dK' Channel Encoder Demux Spreading User K dK Channel Estimator CDMA Uplink System
Maximum Likelihood - Channel Estimation • Send a time-multiplexed Preamble (Pilot). • Channel properties extracted from received signal. • Compare received signal with known pilot and estimate channel parameters. • Keep estimate for remaining data bits (static). • Repeat preamble every frame, if no tracking.
The Maximum Likelihood Algorithm • Compute the correlation matrices • Compute the channel estimate • Calculate the noise covariance matrix K. • Calculate the channel impulse response vectorz. • Extract the ampitudes and delays from the channel impulse response vector using least squares fit.
The ML Algorithm Complexity • Complex-Real Dot Product. • Complex-Real Matrix Product. • Complex -Real Product. • Real Square roots. • Solving quadratic equation for least squares fit. • Critical code : Matrix-vector multiplications / Dot Product Offline Assuming Unity Noise Covariance
Differencing Multistage - Multiuser Detection • Based on the principle of Parallel Interference Cancellation (PIC) • Cross-correlation information used to remove interference of other users from desired user • Repeated iterations for convergence • Differencing techniques applied for improving the performance of the algorithm
The Differencing Multistage Detector • Split the crosscorrelation matrix into lower, upper and the diagonal matrix. • Calculate the channel impulse response iteratively using • x is called the differencing vector.
Multistage Detector Complexity • Matrix Multiplication: • Computed only once for one frame • Dot Product: • Computed iteratively • Critical code: Dot Product
TI Tools Used • Evaluation Modules (EVM) for C6201 and C6701 fixed and floating point DSPs • 64 KB each internal program & data memory • 256 KB SBSRAM, 8 MB SDRAM (external) • C Compiler ver 3.0 from Code Generation Tools • Code Composer ver 4.02 for profiling the code
DSP Implementation: Channel Estimation • Floating point implementation found more feasible due to matrix inversions and square-roots. • Code optimized for the DSP • Use of Specialized approximate instructions • Approximate reciprocal square roots • Approximate reciprocals • Use of Assembly Code for critical part. • TI's C67 floating point benchmarks for Matrix-Vector Multiplication & Dot Product • Data Memory requirements for Channel Estimation
Use of specialized instructions and assembly code on C6701 DSP 140 C6701: Original C6701: with Intrinsics 120 C6701: with Assembly 100 10% improvement 80 Execution time(in milliseconds) --> 60 40 100% improvement 20 0 0 5 10 15 Number of users --> Use of Approximate Instructions L = 150, P =3, N= 31, SNR = 5dB, SINR = -10 dB
Effect of optimizations for Channel Estimation on C6701--> 100 Base 1.08X improvement (-o3 -pm) 90 Approx. 80 (-o3 -pm with intrinsics) 70 60 2.34X improvement Execution time(normalized) --> Assembly opt. 50 (-o3 -pm with asm) 40 30 20 10 0 1 2 3 Optimization Effects for Channel Estimation
Data Memory Requirements Data to be placed in External memory 6 130
DSP Implementation: Multistage Detection • 16-bit Fixed Point C Code • Code optimized for the DSP • Use of Assembly Code for critical part • TI's C62 fixed point assembly benchmarks for Dot Product • Data memory requirements for Multistage Detection
Effect of optimizations for Multistage Detection on C6201 --> 100 90 Global opt. (-o3 -pm -mu) 80 70 60 Execution time(normalized) --> 50 5.22X improvement 40 7.47X improvement Software Pipelining 30 (-o3 -pm) Assembly opt. (-o3 -pm with asm) 20 10 0 1 2 3 Optimization Effects for Multistage Detector
Data Memory Requirements Data can be placed completely in Internal memory
Users:K=15 SNR=6dB 4 x 10 14 12 Conventional Method Differencing Method 10 8 Number of Flops 6 4 2 0 1 2 3 4 5 6 7 8 Total Number of Iterations Flops Count 2X speedup for a three-stage detector conventional differencing
SNR=10dB WindowSize=12 350 300 Conventional Method Differencing Method 250 200 MAX BIT RATE PER USER (kb/s) 150 100 50 8 9 10 11 12 13 14 NUMBER OF USERS Real-Time Requirements Real-Time capability by C6201 DSP 12users 150kb/s
Trends in Recent DSPs • More internal memory and higher clock speeds • C6203 : 512 KB data, 384 KB program, 250 MHz • useful for uplink channel estimation algorithms. • Specialized Blocks in the DSP Core. • Viterbi decoding in C54. • Lower Voltage operation • 1.2 V in C5402 , useful for saving power consumption in the mobile.
ASIC Implementation • Differencing Multistage Detector Block • MOSIS Tiny-Chip (40-pin DIP) • 8 synchronous users • 12-bit fixed point implementation • 6000 transistors • 1.2 m CMOS technology • 190kb/s for each user (@12.5MHz) • 3-stage cascade delay < 15 s
REG A L U SHIFT RECODER Control Logic (L+L’)A Chip (Single Stage) Architecture Internal signals External signals
Chip Layout 2.0 mm Soft Decisions Recoding logic Cross-Correlation 12-bit ALU
Matched Filter Output Detector Output Sout Sout Sout Sin Sin Sin Hin Hout Hin Hout Hin Hout Output Valid Hand Shaking Fin Fout Fin Fout Fin Fout Load Load Load CLK CLK CLK 1/2 1/2 1/2 Load R Clock 3-stage Cascade Mode
Current Work – GPP vs. DSP • Joint work with Prof. Sarita Adve, Praful Kaul, and Parthasarathy Ranganathan • Performance of general-purpose systems • Comparing GPP and DSP performance • Complete 3G benchmark suite with all components • Identification of key performance bottlenecks
Preliminary Results (1 of 4) • (4 algorithms: channel estimation, multi-stage detection, FIR filter, dot product) • Performance of general-purpose processors • Instruction-level parallelism features help (3.4X to 4.4X) • Media ISA extensions help (1.2X to 5.4X) • New extensions for packing/multiplication useful • Comparing GPP and DSP performance • GPPs outperform DSPs • UltraSPARC-II+VIS 2-4X better than TI TMS320C6701 • Caveat: compiler issues with DSP
Preliminary Results (2 of 4) K USERS user’s bits CHANNEL CODING SOURCE CODING SPREADING MODULATION • Important to study complete system including all components • Need for complete benchmark suite (MOBILE USER) TRANSMITTER detected bits of all K users DEMODULATION DECODER DETECTOR (BASE STATION) RECEIVER CHANNEL ESTIMATION
Preliminary Results (3 of 4) • Complete 3G benchmark suite with all components • Source coding • Channel coding • Spreading • Modulation/De-modulation • Multi-stage detection • Channel estimation • Channel decoding • Source decoding • Used either public-domain or in-house “C” code • Optimized with ISA extensions
Preliminary Results (4 of 4) • Choice of source coding standard makes big difference • G728 system: source coding/decoding dominant • GSM system: channel estimation/detection dominant
Conclusions • Implementation issues : Estimation & Detection Algorithms • Channel Estimation - Floating Point / External Memory • Multistage Detection - Fixed Point / Internal Memory • Specialized instructions : square root/reciprocals. • Additional support for complex arithmetic useful. • Recent trends in GPP / DSPs highly encouraging for next generation wireless communication applications.
Future Work • FPGA / ASIC Implementation via VHDL models and SPW • Program & DSP implementations for W-CDMA uplink and downlink • Blind Algorithms • Adaptive Algorithms • Architectural bottlenecks and compiler issues in DSPs to enhance suitability for next generation W-CDMA systems • Multiple DSPs – mixed DSP / FPGA for mNIC