280 likes | 289 Views
A Functional Network Simulator for IBM Cell based Multiprocessor Systems. Presented By: Vishakha Gupta Advised By: Prof. Sudhakar Yalamanchili School of Electrical and Computer Engineering Website: http://www.cc.gatech.edu/~vishakha/projects.php. Agenda.
E N D
A Functional Network Simulator for IBM Cell based Multiprocessor Systems Presented By: Vishakha Gupta Advised By: Prof. Sudhakar YalamanchiliSchool of Electrical and Computer Engineering Website: http://www.cc.gatech.edu/~vishakha/projects.php
Agenda • Cell Broadband Engine (CBE) architecture • Motivation • Design of the Multi-Cell Simulator (MCS) • Programming Model • Execution Model • API • Implementation • Benchmarks • Analysis of benchmark performance • Conclusion
CBE Architecture - Overview • 64bit Power architecture forms the foundation • Dual thread Power Processor Element (PPE) • In-order two issue superscalar design • Support for simultaneous (up to 2) multithreading • Eight Synergistic Processor Elements (SPEs) • Based on the SIMD-RISC instruction set • 128-entry 128 bit unified register file for all data types
CBE Architecture – Overview [2] • On-chip Rambus XDR controller with support for two banks of Rambus XDR memory • Cell processor production die has 235m transistors and is 235mm2 • Excludes networking peripherals or large memory arrays on chip • Reaches high performance due to high clock speed and high-performance XDR DRAM interface
CBE Architecture – Memory Model • Power core • 32K 2-way instruction cache and 32 K 4-way set associative data cache • 256KB local store on SPE, 6 cycle load latency • Software must manage data in and out of local store • Controlled by the memory flow controller • Does not participate in hardware cache coherency • Aliased in the memory map of the processor • PPE can load and store from a memory location mapped to the local store (slow) • SPE can use the DMA controller to move data to its own or other SPEs local store • Memory flow controller on SPE can begin to transfer the data set of the next task as present one is running – Double Buffering
Multi-Cell Simulator - Motivation • Cell architecture suitable for advanced visualization, streaming and scientific kind of applications • Example of heterogeneous multi-core architecture – talk of the future • Feasibility of generating and running parallel code on multiple interconnected Cell processors • See Roadrunner (Supercomputer being built at LANL with 64k AMD Opteron and 16K IBM Cell processors)! • Great advantage to various research groups like compilers • Simulate different programming techniques • Test their effectiveness on these heterogeneous architectures Adapt parallel computing world to work the heterogeneous multi-core way
Design Goals • Ease of use by programmers • Convenient APIs for faster and more efficient parallel programming • Performance • Less time should be spent in MCS library functions • Scalability • For massively parallel application simulations
Implementation Goals • Extensibility • Ease of plugging in different interconnects and programming models • Reliability • Easy to debug application if middleware can be assumed stable • More than being just a functional simulator • Latency estimations for different interconnects
Programming Model • Create a “platform” consisting of ‘n’ PPEs and ‘m’ SPEs • Programmer can write code as if all on one machine • Point to point communication between different elements (PPE/SPE) in the system • Group Communication • Form group of SPE/PPE/mixed for collective communication • Broadcast to all or multicast to an existing group • Communication units between Elements (PPEs/SPEs) – • Packet – Send/receive data in one call • Stream – Send/receive data at a specified rate or split into multiple buffers
Execution Model • Communication possibilities • PPE to local SPE and vice versa – DMA/mailbox/channels/memory mapped IO • PPE to remote PPE – Network API • PPE to remote SPE – • PPE to remote PPE responsible for the given SPE • PPE to local SPE • SPE to remote SPE - • Not expected to make MCS library calls directly – code bloat in SPE local store (likely, yet to test) • Copy data over to control PPE • Same as PPE to remote SPE
Execution Model [2] • Communication combinations • Element to Element or group send/receive which can be • Blocking or non-blocking • Reliable or Unreliable • Application can request more parallelism by specifying number of threads that should handle the send/receive • In-order delivery or out of order delivery of data • Programmer can use common APIs for local as well as remote communication • Location of a PPE or SPE transparent to application • But local send and receive optimized
Application + MCS Library Application + MCS Library Simulated Linux OS Simulated Linux OS IBM Cell Simulator IBM Cell Simulator TUN Device Interface TUN Device Interface Host Linux OS Host Linux OS External Network Interface External Network Interface Physical Interconnect Design of the Multi-Cell Simulator Implementation Units
Cell Applications Multi Cell API TCP/IP Backend Infiniband Backend Latency Analysis Latency Analysis Shared Memory Backend Latency Analysis …. Simulated Linux Operating System (Fedora Core 5) Simulated Cell Hardware TUN device interface Host Linux Operating System (Fedora Core 5/6) Software Stack MCS/Pooled Accelerator Library
Host 1 On Master Host Parallel Application MCS Boot MCS Library Configure Network Socket based Network Communication Configure Mambo Configuration Information for all simulator instances Hosted OS with TUN connection through Mambo to Host OS Software - Current Multi-Cell Simulator Config File NumHosts=4 NumPPE=6 NumSPE=24 ……. More Hosts ….
API • Network • Connection establishment • Send, Receive, Query, Wait for both point to point and group communication • Group • Create, modify, delete groups of Cell elements • Startup and Cleanup • Create and remove data structures for the library • Timing • Synchronize groups • Memory • Allocate and de-allocate memory for buffers needed by the application code
Implementation - Setup • Use Mambo to simulate the Cell processor • Use TUN devices to simulate general network of Cells • Configure all simulator instances to fall in the same subnet • Use bridging support with TUN devices for automatic message redirection between network of Cell processors • Enable TUN-Ethernet forwarding • Configure routing tables on simulator as well as host
Implementation - Basics • Library of API implementation to be linked with the parallel application • Code written completely in C • Headers contain all the available function prototypes • Data exchange between local elements (PPE and SPE on same simulator instance) through a fast path • No need to make socket calls
Implementation [2] • Thread implementation using pthread library • Carefully managed thread pools • Increase performance while taking care of scalability • Multiple queues for handling data send, receive, re-order • Necessary to avoid contention in heavily threaded programs
Benchmarks • To be filled in soon
Results • To be filled in soon
Analysis of Results • To be filled in soon
Conclusion • To be filled in soon
Future Work • Implement communication using different interconnect than APIs other that for Ethernet • Add latency calculation based on interconnect • Automate the complete startup of the simulator depending on user input • Add additional communication models like block and pipe for tightly coupled multi-cell systems
References [1] Michael Kistler, Michael Perrone,Fabrizio Petrini. "CELL MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED". In IEEE Micro, 26(3), May/June 2006 [2] Kewin Krewell. "CELL MOVES INTO THE LIMELIGHT". Microprocessor {2/14/05-01 [3] Maxim Krasnyansky. "Universal TUN/TAP device driver". http://www.kernel.org/pub/linux/kernel/people/marcelo/linux-2.4/Documentation/networking/tuntap.txt [4] Cell Broadband Engine resource center. http://www-128.ibm.com/developerworks/power/cell/ [5] H. Peter Hofstee. “Introduction to Cell Broadband Engine”