A Functional Network Simulator for IBM Cell based Multiprocessor Systems

A Functional Network Simulator for IBM Cell based Multiprocessor Systems Presented By: Vishakha Gupta Advised By: Prof. Sudhakar YalamanchiliSchool of Electrical and Computer Engineering Website: http://www.cc.gatech.edu/~vishakha/projects.php

Agenda • Cell Broadband Engine (CBE) architecture • Motivation • Design of the Multi-Cell Simulator (MCS) • Programming Model • Execution Model • API • Implementation • Benchmarks • Analysis of benchmark performance • Conclusion

CBE Architecture

CBE Architecture - Overview • 64bit Power architecture forms the foundation • Dual thread Power Processor Element (PPE) • In-order two issue superscalar design • Support for simultaneous (up to 2) multithreading • Eight Synergistic Processor Elements (SPEs) • Based on the SIMD-RISC instruction set • 128-entry 128 bit unified register file for all data types

CBE Architecture – Overview [2] • On-chip Rambus XDR controller with support for two banks of Rambus XDR memory • Cell processor production die has 235m transistors and is 235mm2 • Excludes networking peripherals or large memory arrays on chip • Reaches high performance due to high clock speed and high-performance XDR DRAM interface

CBE Architecture – Memory Model • Power core • 32K 2-way instruction cache and 32 K 4-way set associative data cache • 256KB local store on SPE, 6 cycle load latency • Software must manage data in and out of local store • Controlled by the memory flow controller • Does not participate in hardware cache coherency • Aliased in the memory map of the processor • PPE can load and store from a memory location mapped to the local store (slow) • SPE can use the DMA controller to move data to its own or other SPEs local store • Memory flow controller on SPE can begin to transfer the data set of the next task as present one is running – Double Buffering

Multi-Cell Simulator - Motivation • Cell architecture suitable for advanced visualization, streaming and scientific kind of applications • Example of heterogeneous multi-core architecture – talk of the future • Feasibility of generating and running parallel code on multiple interconnected Cell processors • See Roadrunner (Supercomputer being built at LANL with 64k AMD Opteron and 16K IBM Cell processors)! • Great advantage to various research groups like compilers • Simulate different programming techniques • Test their effectiveness on these heterogeneous architectures Adapt parallel computing world to work the heterogeneous multi-core way

Design Goals • Ease of use by programmers • Convenient APIs for faster and more efficient parallel programming • Performance • Less time should be spent in MCS library functions • Scalability • For massively parallel application simulations

Implementation Goals • Extensibility • Ease of plugging in different interconnects and programming models • Reliability • Easy to debug application if middleware can be assumed stable • More than being just a functional simulator • Latency estimations for different interconnects

Programming Model • Create a “platform” consisting of ‘n’ PPEs and ‘m’ SPEs • Programmer can write code as if all on one machine • Point to point communication between different elements (PPE/SPE) in the system • Group Communication • Form group of SPE/PPE/mixed for collective communication • Broadcast to all or multicast to an existing group • Communication units between Elements (PPEs/SPEs) – • Packet – Send/receive data in one call • Stream – Send/receive data at a specified rate or split into multiple buffers

Execution Model • Communication possibilities • PPE to local SPE and vice versa – DMA/mailbox/channels/memory mapped IO • PPE to remote PPE – Network API • PPE to remote SPE – • PPE to remote PPE responsible for the given SPE • PPE to local SPE • SPE to remote SPE - • Not expected to make MCS library calls directly – code bloat in SPE local store (likely, yet to test) • Copy data over to control PPE • Same as PPE to remote SPE

Execution Model [2] • Communication combinations • Element to Element or group send/receive which can be • Blocking or non-blocking • Reliable or Unreliable • Application can request more parallelism by specifying number of threads that should handle the send/receive • In-order delivery or out of order delivery of data • Programmer can use common APIs for local as well as remote communication • Location of a PPE or SPE transparent to application • But local send and receive optimized

Platform View

MPI-style Communication

Application + MCS Library Application + MCS Library Simulated Linux OS Simulated Linux OS IBM Cell Simulator IBM Cell Simulator TUN Device Interface TUN Device Interface Host Linux OS Host Linux OS External Network Interface External Network Interface Physical Interconnect Design of the Multi-Cell Simulator Implementation Units

Cell Applications Multi Cell API TCP/IP Backend Infiniband Backend Latency Analysis Latency Analysis Shared Memory Backend Latency Analysis …. Simulated Linux Operating System (Fedora Core 5) Simulated Cell Hardware TUN device interface Host Linux Operating System (Fedora Core 5/6) Software Stack MCS/Pooled Accelerator Library

Host 1 On Master Host Parallel Application MCS Boot MCS Library Configure Network Socket based Network Communication Configure Mambo Configuration Information for all simulator instances Hosted OS with TUN connection through Mambo to Host OS Software - Current Multi-Cell Simulator Config File NumHosts=4 NumPPE=6 NumSPE=24 ……. More Hosts ….

API • Network • Connection establishment • Send, Receive, Query, Wait for both point to point and group communication • Group • Create, modify, delete groups of Cell elements • Startup and Cleanup • Create and remove data structures for the library • Timing • Synchronize groups • Memory • Allocate and de-allocate memory for buffers needed by the application code

Implementation - Setup • Use Mambo to simulate the Cell processor • Use TUN devices to simulate general network of Cells • Configure all simulator instances to fall in the same subnet • Use bridging support with TUN devices for automatic message redirection between network of Cell processors • Enable TUN-Ethernet forwarding • Configure routing tables on simulator as well as host

Implementation - Basics • Library of API implementation to be linked with the parallel application • Code written completely in C • Headers contain all the available function prototypes • Data exchange between local elements (PPE and SPE on same simulator instance) through a fast path • No need to make socket calls

Implementation [2] • Thread implementation using pthread library • Carefully managed thread pools • Increase performance while taking care of scalability • Multiple queues for handling data send, receive, re-order • Necessary to avoid contention in heavily threaded programs

Implementation- Current

Benchmarks • To be filled in soon

Results • To be filled in soon

Analysis of Results • To be filled in soon

Conclusion • To be filled in soon

Future Work • Implement communication using different interconnect than APIs other that for Ethernet • Add latency calculation based on interconnect • Automate the complete startup of the simulator depending on user input • Add additional communication models like block and pipe for tightly coupled multi-cell systems

References [1] Michael Kistler, Michael Perrone,Fabrizio Petrini. "CELL MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED". In IEEE Micro, 26(3), May/June 2006 [2] Kewin Krewell. "CELL MOVES INTO THE LIMELIGHT". Microprocessor {2/14/05-01 [3] Maxim Krasnyansky. "Universal TUN/TAP device driver". http://www.kernel.org/pub/linux/kernel/people/marcelo/linux-2.4/Documentation/networking/tuntap.txt [4] Cell Broadband Engine resource center. http://www-128.ibm.com/developerworks/power/cell/ [5] H. Peter Hofstee. “Introduction to Cell Broadband Engine”

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

Presentation Transcript

Multiprocessor Systems

Bus-Based Multiprocessor

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

NS2 Network Simulator

AdventNet Network Simulator

Programming the Cell Multiprocessor

network simulator 3

GPU Functional Simulator

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems

Network Simulator - 2

NETWORK SIMULATOR

Network Simulator - 2

Functional cell

Network Simulator 2

Network Simulator Tutorial

Network Simulator Based Network Management Systems

Best Simulator Options for the Virtual Network Simulator

Multiprocessor Systems

NETWORK SIMULATOR

Network Simulator - NS

GPU Functional Simulator