240 likes | 355 Views
Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability. Bo Li , Zhigang Huo, Panyong Zhang, Dan Meng { leo , zghuo, zhangpanyong, md}@ncic.ac.cn Presenter: Xiang Zhang zhangxiang@ncic.ac.cn.
E N D
Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng {leo, zghuo, zhangpanyong, md}@ncic.ac.cn Presenter: Xiang Zhang zhangxiang@ncic.ac.cn Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Introduction • Virtualization is now one of the enabling technologies of Cloud Computing • Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include: • Penguin's POD • IBM's Computing On Demand service • R Systems' dedicated hosting service • Amazon’s EC2
Introduction:Virtualizing HPC clouds? • Pros: • good manageability • proactive fault tolerance • performance isolation • online system maintenance • Cons: • Performance gap • Lack low latency interconnects, which is important to tightly-coupled MPI applications • VMM-bypass has been proposed to relieve the worry
Introduction:VMM-bypass I/O Virtualization • Xen split device driver model only used to setup necessary user access points • data communication in the critical path bypasses both the guest OS and the VMM VMM-Bypass I/O (courtesy [7])
Introduction:InfiniBand Overview • InfiniBand is a popular high-speed interconnect • OS-bypass/RDMA • Latency: ~1us • BW: 3300MB/s • ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems June 2010 Source:http://www.top500.org
Introduction:InfiniBand Scalability Problem • Reliable Connection (RC) • Queue Pair (QP), Each QP consists of SQ and RQ • QPs require memory • Shared Receive Queue (SRQ) • eXtensible Reliable Connection (XRC) • XRC domain & SRQ-based addressing N: node count C: cores per node Conns/Process: (N-1)×C Conns/Process: (N-1) SRQ8 SRQ7 SRQ6 SRQ5 SRQ RQ
Problem Statement • Does scalability gap exist between native and virtualized environments? • CV: cores per VM Scalability gap exists!
Presentation Outline • Introduction • Problem Statement • Proposed Design • Evaluation • Conclusions and Future Work
Proposed Design:VM-proof XRC design • Design goal is to eliminate the scalability gap • Conns/Process: (N-1)×(C/CV) (N-1)
Proposed Design:Design Challenges • VM-proof sharing of XRC domain • A single XRC domain must be shared among different VMs within a physical node • VM-proof connection management • With a single XRC connection, P1 is able to send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in
Proposed Design:Implementation • VM-proof sharing of XRCD • XRCD is shared by opening the same XRCD file • guest domains and IDD have dedicated, non-shared filesystem • pseudo XRCD file and real XRCD file • VM-proof CM • Traditionally IP/hostname was used to identify a node • LID of the HCA is used instead
Proposed Design:Discussions • safe XRCD sharing • unauthorized applications from other VMs may share the XRCD • the isolation of the sharing of XRCD could be guaranteed by the IDD • isolation between VMs running different MPI jobs • By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other • XRC migration • main challenge: XRC connection is a process-to-node communication channel. • Future work
Presentation Outline • Introduction • Problem Statement • Proposed Design • Evaluation • Conclusions and Future Work
Evaluation:Platform • Cluster Configuration: • 128-core InfiniBand Cluster • Quad Socket, Quad-Core Barcelona 1.9GHz • Mellanox DDR ConnectX HCA, 24-port MT47396 Infiniscale-III switch • Implementation • Xen 3.4 with Linux 2.6.18.8 • OpenFabrics Enterprise Edition (OFED) 1.4.2 • MVAPICH-1.1.0
Evaluation:Microbenchmark Explanation: Memory copy operations under virtualized case would include interactions between the guest domain and the IDD. • The bandwidth results are nearly the same • Virtualized IB performs ~0.1us worse when using blueframe mechanism. • memory copy of the sending data to the HCA's blueframe page IB verbs latency using doorbell MPI latency using blueframe IB verbs latency using blueframe
Evaluation: VM-proof XRC Evaluation • Configurations • Native-XRC: Native environment running XRC-based MVAPICH. • VM-XRC (CV=n): VM-based environment running unmodified XRC-based MVAPICH. The parameter CV denotes the number of cores per VM. • VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design.
Evaluation:Memory Usage 13GB • 16 cores/node cluster fully connected • The X-axis denotes the process count • ~12KB memory for each QP • 16x less memory usage • 64K processes will consume 13GB/node with the VM-XRC (CV=1) configuration • The VM-proof XRC design reduces the memory usage to only 800MB/node Better 800MB
Evaluation:MPI Alltoall Evaluation • a total of 32 processes • 10%~25% improvement for messages < 256B VM-proof XRC Better
Evaluation: Application Benchmarks • VM-proof XRC performs nearly the same as Native-XRC • Except BT and EP • Both are better than VM-XRC VM-proof XRC Better • little variation for different CV values • Cv=8 is an exception • Memory allocation not NUMA-aware guaranteed Better
Evaluation: Application Benchmarks (Cont’d) ~15.9x less conns ~14.7x less conns
Conclusion and Future Work • VM-proof XRC design converges two technologies • VMM-bypass I/O virtualization • eXtensible Reliable Connection in modern high speed interconnection networks (InfiniBand) • the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design • ~16x scalability improvement is seen in 16-core/node clusters • Future work • evaluations on different platforms with increased scale • add VM migration support to our VM-proof XRC design • extend our work to the newly SRIOV-enabled ConnectX-2 HCAs
Questions? {leo, zghuo, zhangpanyong, md}@ncic.ac.cn
OS-bypass of InfiniBand OpenIB Gen2 stack