230 likes | 429 Views
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
E N D
Development of Parallel Simulator for Wireless WCDMA NetworkHong ZhangCommunication lab of HUT
Outline 1. Overview 1.1 The Requirement for Computational Speed of Simulation for Wireless WCDMA system 1.2 Parallel Programming 2. Types of Parallel Computers 2.1 Shared Memory Multiprocessor System 2.2 Message Passing Multiprocessor with Local Memory 3. Parallel Programming Scenarios 3.1 Ideal Parallel Computations3.2 Partitioning and Divide-and- Conquer Strategies 3.3 Pipelined Computation 3.4 Synchronous Computation 3.5 Load balancing 3.6 Multiprocessor with Shared Memory 4. Progress of the project
1. Overview 1.1 The Requirement for Computational Speed of Wireless WCDMA Network Simulation •In mobile communication, the development of advanced signal processing techniques such as smart antenna and MUD can improve the system performance, but require signal or system level simulation.•Simulation is an important tool for getting insight into the problem. However, often it is very time consuming task to simulate the signal processing algorithms• It is necessary to speed up simulation. Parallel programming is one of the best techniques to solve this problem.
1.2 Parallel ProgrammingParallel programming can speed up the execution of a program by dividing the program into multiple fragments that can be executed simultaneously, each on it’s own processor.Parallel programming involves: ♦ Decomposing an algorithm or data into parts ♦ Distributing sub-tasks which are processed by multiple processors simultaneously ♦ Coordinating work and communications between those processors
1.2 Parallel Programming ( cont. )The Requirements for Parallel Programming♦ Parallel architecture being used ♦ Multiple processors ♦ Network ♦ Environment to create and manage parallel processing ♦ A parallel algorithm and parallel program
2. Types of Parallel Computers2.1 Shared Memory Multiprocessor System CPU CPU Memory CPU CPU ♦ Multiple processors operate independently but share the same memory resources. ♦ Only one processor can access the shared memory location at a time ♦ Synchronisation achieved by controlling with READING FROM and WRITING TO the shared memory.
2.1 Shared Memory Multiprocessor System(cont.) ♦ Advantages • Easy for user to use efficiently • Data sharing among tasks is fast ( speedup memory access ) ♦ Disadvantages • The size of memory might be a limiting factor. Increase the number of processors without increase of the size of memory can cause severe bottlenecks • User is responsible for establishing synchronization.
CPU Memory CPU Memory Memory CPU Memory CPU 2.2 Message Passing Multiprocessor with Local Memory Network ♦ Multiple processors operate independently but each has its own local memory. ♦ Data is shared across communication network using message passing ♦ User is responsible for synchronization using message passing.
2.2 Message Passing Multiprocessor with Local Memory (cont) ♦ Advantages • Memory scalable to number of processors. Increase number of processors with their own memory , the total size of memory will be increased comparing with the shared memory multiprocessor system. • Each processor can rapidly access its own memory without limitation. ♦ Disadvantages • Difficult to map existing data structures. • User is responsible for sending and receiving data among processors • To minimize overhead and latency, data should be stacked up in large blocks before receiving nodes will need it.
3. Parallel Programming Scenario3.1Ideal Parallel Computations• A computation can be readily divided into completely independent parts that can be executed simultaneously . • Example: In the simulation of Uplink WCDMA (single user), signal processing at the transmitter and the receiver are divided into smaller parts, executed by separate processors.
AWGN 3.1Ideal Parallel Computations (cont.)Example: simulation of wireless communication with Ideal Parallel Computation Transmitter CPU 1 Source data generation (traffic/packet) CPU 2 Channel coding and data matching CPU4 Spreading and scrambling CPU 5 Pulse shaping filtering CPU 3 Modulation Radio channel Receiver CPU 6 Reconstruction of the composite signal (signal, channel,AWGN) CPU 10 Channel decoding CPU 7 Matched filtering CPU 8 Rake combining CPU 9 demodulation
3.2 Task Partitioning and Divide-and-Conquer Strategies • Partitioning: the problem is simply divided into separate parts and each part is computed separately • Divide-and-Conquer: to divide task continually into smaller and smaller subtasks before solving the smaller parts and the results are combined • Example: In the simulation of Rake combining technique in WCDMA, the problem can be continually divided among different fingers. In each finger, the problem can be also divided into correlating, delay equalizing, MRC/EGC combining.
Rake Combining CPU 2 modified with the channel estimate CPU 1 Correlating CPU 3 combining with MRC/EGC Finger 1 Finger 2 Finger K 3.2 Partitioning and Divide-and- Conquer Strategies (cont.)Example: the simulation of wireless communication with Divide-and- Conquer Strategy
3.3Pipelined Computation • The problem is divided into a series of tasks that have to be completed one after the other. • Each task will be executed by a separate processor • Partially sequential in nature • Example: In the simulation of WCDMA transmitter and receiver, each block of signal processing needs the output of the previous block as its input. In this case, Pipelining technique is adopted to parallel sequential source code.
AWGN 3.3Pipelined Computation (cont.)Example: the simulation of wireless communication with Pipelined Computation Transmitter CPU 1 Source data generation (traffic/packet) CPU 2 Channel coding and data matching CPU4 Spreading and scrambling CPU 5 Pulse shaping filtering CPU 3 Modulation Radio channel Receiver CPU 6 Reconstruction of the composite signal (signal, channel,AWGN) CPU 10 Channel decoding CPU 7 Matched filtering CPU 8 Rake combining CPU 9 demodulation
3.4Synchronous Computation • Processors need to exchange data between themselves. • All the processes start at the same time in a lock-step manner • Each process must wait until all processes have reached a particular reference point (barrier) in their computation. • Example: WCDMA system Smart Antenna (SA) : the signal processing in each branch of antenna elements must be finished before combining them. Rake Combining: the signal processing in each finger must be finished before combining them. Multiuser Detection(MUD): as MUD for each user signal needs other users’ signal message, the operation for all users’ signal must be finished before MUD.
AWGN AWGN 3.4Synchronous Computation (cont.)Example: the simulation of wireless communication with Synchronous Computation User N MUD … Rake Combining User 1 CPU Finger K Received signal reconstruction Beam forming Matched filtering … CPU Finger 1 CPU Rake Combining … Modified with the channel estimate CPU Correlating MUD CPU Beamforing Combining w … … … … … … CPU Finger K … CPU Rake Combining CPU Finger 1 CPU … Modified with the channel estimate Correlating w
Mutiuser Detection The output of user 1’ beamforming /combining CPU The signature waveform of user 1 ... The output of user 2’ beamforming /combining CPU The signature waveform of user 2 ... . . . The output of user N’ beamforming /combining ... The signature waveform of user N CPU ... 3.4Synchronous Computation (cont.)Example: the simulation of wireless communication with Synchronous Computation
3.5Load balancing • to distribute computation load fairly across processors in order to obtain the highest possible execution speed. • Example: WCDMA system Smart Antenna (SA) : the speed of Direction of arrival (DOA) variation for different user signal can be different, this means that beamforming processor for different user could have different number of operations. The load of all processors can be fairly balanced by detecting if the solution has been reached on each processor. Rake Combining: the number of multipath signals for different users could be different. The load of all processors can be fairly balanced by detecting if the solution has been reached by each processor.
CPU 2N ( user N) CPU N ( user N) CPU 1 ( user 1) Computation time Computation time Computation time 3.5Load balancing (cont.) Example: the simulation of wireless communication with Load balancing Rake Combining Beamforming CPU N+1 ( user 1) Computation time CPU 2 ( user 2 has more number of multipath signals) than that of other users CPU N+2 ( the channel parameter of user 2 are varying faster than that of other users) Computation time Computation time . . . . . .
3.6Multiprocessor with Shared Memory • Multiprocessor with shard memory can speed up programming by storing the executable code and data in shared memory for each processor. • Example In the simulation of WCDMA with multiple users, each part of signal processing model could have certain number of algorithms, for example adaptive Beamforming: RLS, LMS, CMA, Conjugate Gradient Method Multiuser Detection: Decorrelating detector, MMSE Detector, Adaptive MMSE Detection etc. All codes for these algorithms are stored in the shared memory. Processing for each user shares all these codes The processor for each user can access these executable codes in the shared memory to speed up the programming.
Beamforming ... ... Cache CPU 1 ( user 1) Cache CPU N ( user N) Cache CPU 1 ( user 1) Cache CPU N ( user N) Memory module ( RLS ) Memory module (CMA) Memory module ( MMSE ) 3.6Multiprocessor with Shared Memory (cont.)Example: the simulation of wireless communication by Multiprocessor with Shared Memory Multiuser Detection Memory module (decorrelating detector) ... ...
4. Progress of the project • The following models of WCDMA system are developed /integrated into • simulator • Spreader/despreder • Spatial Processing • RAKE receiver • Fading radio channel • Some simulation results are obtained for the models verification • Interactions with SARG at Stanford on Rake receiver model • verifications • Work on translation from MATLAB into C language with further • parallelization is accomplished at UCLA.