250 likes | 377 Views
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments. G. Narayanaswamy , P. Balaji and W. Feng. Dept. of Comp. Science Virginia Tech. Mathematics and Comp. Science Argonne National Laboratory. High-end Computing Trends. High-end Computing (HEC) Systems
E N D
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp. Science Argonne National Laboratory
High-end Computing Trends • High-end Computing (HEC) Systems • Continue to increase in scale and capability • Multicore architectures • A significant driving force for this trend • Quad-core processors from Intel/AMD • IBM cell, SUN Niagara, Intel Terascale processor • High-speed Network Interconnects • 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics • Different stacks use different amounts of hardware support • How do these two components interact with each other?
Multicore Architectures • Multi-processor vs. Multicore systems • Not all of the processor hardware is replicated for multicore systems • Hardware units such as cache might be shared between the different cores • Multiple processing units embedded on the same processor die inter-core communication faster than inter-processor communication • On most architectures (Intel, AMD, SUN), all cores are equally powerful makes scheduling easier
Interactions of Protocols with Multicores • Depending on how the stack works, different protocols have different interactions with multicore systems • Study based on host-based TCP/IP and iWARP • TCP/IP has significant interaction with multicore systems • Large impacts on application performance • iWARP stack itself does not interact directly with multicore systems • Software libraries built on top of iWARP DO interact (buffering of data, copies) • Interaction similar to other high performance protocols (InfiniBand, Myrinet MX, Qlogic PSM)
TCP/IP Interaction vs. iWARP Interaction App App App App App App Library Library Library Packet Processing TCP/IP stack Host-processing independent of application process (statically tied to a single core) Host-processing closely tied to application process Packet Processing iWARP offloaded Network Network Packet Arrival Packet Arrival TCP/IP is some ways more asynchronous or “centralized” with respect to host-processing as compared to iWARP (or other high performance software stacks)
Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work
Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work
Application Behavior Pre-analysis • A four-core system is effectively a 3.5 core system • A part of a core has to be dedicated to communication • Interrupts, Cache misses • How do we schedule 4 application processes on 3.5 cores? • If the application is exactly synchronized, there is not much we can do • Otherwise, we have an opportunity! • Study with GROMACS and LAMMPS
GROMACS Overview • Developed by Groningen University • Simulates the molecular dynamics of biochemical particles • The root distributes a “topology” file corresponding to the molecular structure • Simulation time broken down into a number of steps • Processes synchronize at each step • Performance reported as number of nanoseconds of molecular interactions that can be simulated each day
Machine 1 cores Machine 2 cores GROMACS: Random Scheduling
Machine 1 cores Machine 2 cores GROMACS: Selective Scheduling
LAMMPS Overview • Molecular dynamics simulator developed at Sandia • Uses spatial decomposition techniques to partition the simulation domain into smaller 3-D subdomains • Each subdomain allotted to a different process • Interaction required only between neighboring subdomains – improves scalability • Used the Lennard-Jones liquid simulation within LAMMPS Core 0 Core 1 Core 2 Core 3 Network Core 0 Core 1 Core 2 Core 3
Machine 1 cores Machine 2 cores LAMMPS: Random Scheduling
LAMMPS: Intended Communication Pattern MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send() MPI_Wait() MPI_Wait() Computation MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send()
LAMMPS: Actual Communication Pattern “Slower” Core “Slower” Core Faster Core Faster Core MPI_Send() MPI_Send() MPI_Send() MPI buffer MPI buffer Socket Send Buffer Socket Send Buffer Socket Recv Buffer Socket Recv Buffer Application Recv Buffer Application Recv Buffer MPI_Wait() Computation MPI_Wait() Computation MPI_Send() “Out-of-Sync” Communication between processes Application Recv Buffer
Machine 1 cores Machine 2 cores LAMMPS: Selective Scheduling
Presentation Layout • Introduction and Motivation • Treachery of Multicore Architectures • Application Process to Core Mapping Techniques • Conclusions and Future Work
Concluding Remarks and Future Work • Multicore architectures and high-speed networks are becoming prominent in high-end computing systems • Interaction of these components is important and interesting! • For TCP/IP scheduling order drastically impacts performance • For iWARP scheduling order has no overhead • Scheduling processes in a more intelligent manner allows significantly improved application performance • Does not impact iWARP and other high-performance stack making the approach portable while efficient • Dynamic process to core scheduling!
Thank You Contacts: Ganesh Narayanaswamy: cnganesh@cs.vt.edu Pavan Balaji: balaji@mcs.anl.gov Wu-chun Feng: feng@cs.vt.edu For More Information: http://synergy.cs.vt.edu http://www.mcs.anl.gov/~balaji