1 / 28

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

A Functional Network Simulator for IBM Cell based Multiprocessor Systems. Presented By: Vishakha Gupta Advised By: Prof. Sudhakar Yalamanchili School of Electrical and Computer Engineering Website: http://www.cc.gatech.edu/~vishakha/projects.php. Agenda.

juliannag
Download Presentation

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Functional Network Simulator for IBM Cell based Multiprocessor Systems Presented By: Vishakha Gupta Advised By: Prof. Sudhakar YalamanchiliSchool of Electrical and Computer Engineering Website: http://www.cc.gatech.edu/~vishakha/projects.php

  2. Agenda • Cell Broadband Engine (CBE) architecture • Motivation • Design of the Multi-Cell Simulator (MCS) • Programming Model • Execution Model • API • Implementation • Benchmarks • Analysis of benchmark performance • Conclusion

  3. CBE Architecture

  4. CBE Architecture - Overview • 64bit Power architecture forms the foundation • Dual thread Power Processor Element (PPE) • In-order two issue superscalar design • Support for simultaneous (up to 2) multithreading • Eight Synergistic Processor Elements (SPEs) • Based on the SIMD-RISC instruction set • 128-entry 128 bit unified register file for all data types

  5. CBE Architecture – Overview [2] • On-chip Rambus XDR controller with support for two banks of Rambus XDR memory • Cell processor production die has 235m transistors and is 235mm2 • Excludes networking peripherals or large memory arrays on chip • Reaches high performance due to high clock speed and high-performance XDR DRAM interface

  6. CBE Architecture – Memory Model • Power core • 32K 2-way instruction cache and 32 K 4-way set associative data cache • 256KB local store on SPE, 6 cycle load latency • Software must manage data in and out of local store • Controlled by the memory flow controller • Does not participate in hardware cache coherency • Aliased in the memory map of the processor • PPE can load and store from a memory location mapped to the local store (slow) • SPE can use the DMA controller to move data to its own or other SPEs local store • Memory flow controller on SPE can begin to transfer the data set of the next task as present one is running – Double Buffering

  7. Multi-Cell Simulator - Motivation • Cell architecture suitable for advanced visualization, streaming and scientific kind of applications • Example of heterogeneous multi-core architecture – talk of the future • Feasibility of generating and running parallel code on multiple interconnected Cell processors • See Roadrunner (Supercomputer being built at LANL with 64k AMD Opteron and 16K IBM Cell processors)! • Great advantage to various research groups like compilers • Simulate different programming techniques • Test their effectiveness on these heterogeneous architectures Adapt parallel computing world to work the heterogeneous multi-core way

  8. Design Goals • Ease of use by programmers • Convenient APIs for faster and more efficient parallel programming • Performance • Less time should be spent in MCS library functions • Scalability • For massively parallel application simulations

  9. Implementation Goals • Extensibility • Ease of plugging in different interconnects and programming models • Reliability • Easy to debug application if middleware can be assumed stable • More than being just a functional simulator • Latency estimations for different interconnects

  10. Programming Model • Create a “platform” consisting of ‘n’ PPEs and ‘m’ SPEs • Programmer can write code as if all on one machine • Point to point communication between different elements (PPE/SPE) in the system • Group Communication • Form group of SPE/PPE/mixed for collective communication • Broadcast to all or multicast to an existing group • Communication units between Elements (PPEs/SPEs) – • Packet – Send/receive data in one call • Stream – Send/receive data at a specified rate or split into multiple buffers

  11. Execution Model • Communication possibilities • PPE to local SPE and vice versa – DMA/mailbox/channels/memory mapped IO • PPE to remote PPE – Network API • PPE to remote SPE – • PPE to remote PPE responsible for the given SPE • PPE to local SPE • SPE to remote SPE - • Not expected to make MCS library calls directly – code bloat in SPE local store (likely, yet to test) • Copy data over to control PPE • Same as PPE to remote SPE

  12. Execution Model [2] • Communication combinations • Element to Element or group send/receive which can be • Blocking or non-blocking • Reliable or Unreliable • Application can request more parallelism by specifying number of threads that should handle the send/receive • In-order delivery or out of order delivery of data • Programmer can use common APIs for local as well as remote communication • Location of a PPE or SPE transparent to application • But local send and receive optimized

  13. Platform View

  14. MPI-style Communication

  15. Application + MCS Library Application + MCS Library Simulated Linux OS Simulated Linux OS IBM Cell Simulator IBM Cell Simulator TUN Device Interface TUN Device Interface Host Linux OS Host Linux OS External Network Interface External Network Interface Physical Interconnect Design of the Multi-Cell Simulator Implementation Units

  16. Cell Applications Multi Cell API TCP/IP Backend Infiniband Backend Latency Analysis Latency Analysis Shared Memory Backend Latency Analysis …. Simulated Linux Operating System (Fedora Core 5) Simulated Cell Hardware TUN device interface Host Linux Operating System (Fedora Core 5/6) Software Stack MCS/Pooled Accelerator Library

  17. Host 1 On Master Host Parallel Application MCS Boot MCS Library Configure Network Socket based Network Communication Configure Mambo Configuration Information for all simulator instances Hosted OS with TUN connection through Mambo to Host OS Software - Current Multi-Cell Simulator Config File NumHosts=4 NumPPE=6 NumSPE=24 ……. More Hosts ….

  18. API • Network • Connection establishment • Send, Receive, Query, Wait for both point to point and group communication • Group • Create, modify, delete groups of Cell elements • Startup and Cleanup • Create and remove data structures for the library • Timing • Synchronize groups • Memory • Allocate and de-allocate memory for buffers needed by the application code

  19. Implementation - Setup • Use Mambo to simulate the Cell processor • Use TUN devices to simulate general network of Cells • Configure all simulator instances to fall in the same subnet • Use bridging support with TUN devices for automatic message redirection between network of Cell processors • Enable TUN-Ethernet forwarding • Configure routing tables on simulator as well as host

  20. Implementation - Basics • Library of API implementation to be linked with the parallel application • Code written completely in C • Headers contain all the available function prototypes • Data exchange between local elements (PPE and SPE on same simulator instance) through a fast path • No need to make socket calls

  21. Implementation [2] • Thread implementation using pthread library • Carefully managed thread pools • Increase performance while taking care of scalability • Multiple queues for handling data send, receive, re-order • Necessary to avoid contention in heavily threaded programs

  22. Implementation- Current

  23. Benchmarks • To be filled in soon

  24. Results • To be filled in soon

  25. Analysis of Results • To be filled in soon

  26. Conclusion • To be filled in soon

  27. Future Work • Implement communication using different interconnect than APIs other that for Ethernet • Add latency calculation based on interconnect • Automate the complete startup of the simulator depending on user input • Add additional communication models like block and pipe for tightly coupled multi-cell systems

  28. References [1] Michael Kistler, Michael Perrone,Fabrizio Petrini. "CELL MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED". In IEEE Micro, 26(3), May/June 2006 [2] Kewin Krewell. "CELL MOVES INTO THE LIMELIGHT". Microprocessor {2/14/05-01 [3] Maxim Krasnyansky. "Universal TUN/TAP device driver". http://www.kernel.org/pub/linux/kernel/people/marcelo/linux-2.4/Documentation/networking/tuntap.txt [4] Cell Broadband Engine resource center. http://www-128.ibm.com/developerworks/power/cell/ [5] H. Peter Hofstee. “Introduction to Cell Broadband Engine”

More Related