1 / 35

An Efficient Shared Memory Based Virtual Communication System for Embedded SMP Cluster

This research proposes a design that extends the shared memory mechanism into inter-node communications for embedded SMP clusters, aiming to improve performance and efficiency while reducing costs and power consumption.

asara
Download Presentation

An Efficient Shared Memory Based Virtual Communication System for Embedded SMP Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Shared Memory Based Virtual Communication System for Embedded SMP Cluster Wenxuan Yin Institute of Computing Technology Chinese Academy of Sciences Joint work with Xiang Gao, Xiaojing Zhu, ICT, CAS and Deyuan Guo, Tsinghua University NAS 2011

  2. Background • Dilemma in Embedded System • High performance • Cost, power consumption, size, etc. Video/media processing Space-born satellite Wenxuan Yin-NAS 2011

  3. Background • Why SMP cluster is popular in general computing? • High scalability • Good cost-performance ratio • Convenient for MPI programming • It can also benefit the embedded domain • Embedded Cluster • Embedded processor nodes • Commodity networks • moderate performance cost/power efficiency Tradeoff Wenxuan Yin-NAS 2011

  4. Motivations • Challenges by SMP nodes • Two levels of communication • inter-node: high-speed network • intra-node: shared memory/cache • Memory management • memory hierarchy: local vs. remote • coherency maintenance • MPI Inter-Process Communication (IPC) • process allocation in different parallelism • Mutual exclusion and synchronization Performance Gap! Wenxuan Yin-NAS 2011

  5. Motivations • Opportunities in SMP nodes • More computation capacity • High-speed chip-to-chip interconnect fabrics • PCI-E:ARM Cortex A9 MPCore • Serial RapidIO:Freescale 8641D • HyperTransport:ICT Godson-3A • Can we use the fabrics directly to replace traditional NIC based networks? • get rid of NICs, switches, cables How to do? Wenxuan Yin-NAS 2011

  6. Proposed Design Extending the Shared Memory Mechanism into Inter-Node Communications Wenxuan Yin-NAS 2011

  7. Objectives • Compatibility • Software virtulized network TCP/IP protocol • Efficiency • Remote memory Logical shared memory • Narrow the gap between two levels • Economization • Compact interconnect Space and cost effective Wenxuan Yin-NAS 2011

  8. Comparison • Chip-to-chip interconnection changes the network topology Star Mesh UN HT … G G UN Virtual Ethernet Ethernet Switch HT HT UN G G … HT UN UN = Uniprocessor Node G = Godson-3A SMP Wenxuan Yin-NAS 2011

  9. Architecture Godson-3A SMP Nodes Shared Memory Virtual Network HT0: for interconnection Configured into 2 parts HT1: for IO extention Omitted here Memory in each node is divided into 2 parts Wenxuan Yin-NAS 2011

  10. SMP Nodes Godson-3A CPU MIPS64-compatible 4-core superscalar For high performance and low power consumption Godson-3A July 2011 Wenxuan Yin-NAS 2011

  11. More Details • Cache coherency • Directory based cache coherency • HT holds coherency in the whole interconnection system, global addressing in remote accessing • Transparent to programmers • Reconfigurable memory pool • Each node can tune its shared memory size contributing to the memory pool • Extreme case: only master node cedes its shared part Wenxuan Yin-NAS 2011

  12. X-Y Transmission • Built-in routing mechanism in HT • Eliminate switches Examples HT G0 G1 G0 → G3 Virtual Ethernet HT HT G2 G3 G3 → G0 HT Wenxuan Yin-NAS 2011

  13. SMVN Driver • Hierarchical design • Virtual physical layer • Memory copy & optimization • Virtual data link layer • Function and hardware abstraction • Packets encapsulation meet frame format of TCP/IP • Driver management layer • Treat SMVN as a common NIC class device • OS inquiry them recurrently to load & start • Splice SMVN and TCP/IP together! Wenxuan Yin-NAS 2011

  14. SMVN Driver TCP/IP upper protocol SMVN Wenxuan Yin-NAS 2011

  15. Communication • How to implement the communication across networks? Ethernet or others Wenxuan Yin-NAS 2011

  16. Memory management • Data structures on SMVN buffer • Singly Linked List (SLL) Shared memory pool→L Packet Packet …… Packet Packet • FreeList: global, unique • InputList: each node maintains one No Extra Memory Allocation! head tail Packet Packet …… Packet Packet Wenxuan Yin-NAS 2011

  17. Packets transmission Examples Node 0 as a sender Node 1 as a receiver FreeList holds all data, InputList is NULL Sending: fetch (FreeList), copy, insert (InputList), trigger an interrupt Receiving: fetch (InputList), copy, insert (FreeList) Wenxuan Yin-NAS 2011

  18. Optimization • Essentially an optimization to memory operations! • Increase the concurrency • Pipelining effect • Minimize memory access numbers • Zero-copy scheme • Reduce memory access time • Instruction-level optimization Wenxuan Yin-NAS 2011

  19. Concurrency • Overlap SEND/RECV operations! Pipelining effect! serial concurrency Wenxuan Yin-NAS 2011

  20. Zero-Copy • Change the head/tail pointers • Change the relationship which list the packets belong to tail tail head head FreeList InputList Packets migration Shared memory pool (L) Data copy Only scenario Network mem pool SMVN mem pool • Extra benefit: reduce power consumption! Wenxuan Yin-NAS 2011

  21. Bottom Optimization • To accelerate memcpy • Using cache coherency maintained by hardware • Using cached address space • Do not need flush/invalidate by programmers • Godson-3A double-word (64bit) RW • Unaligned memory access Wenxuan Yin-NAS 2011

  22. Mutual Exclusion • Why we need this? • Concurrency leads to an unpredictable outcome • Solution: spinlock • Keep atomic in shared resources operations • Test-And-Set (TAS) primitive • In Godson-3A nodes • ll(load-linked) & sc (store-conditional) instruction pair Wenxuan Yin-NAS 2011

  23. Simple Lock TAS primitive • ll will record address while loading • sc can judge whether the address is modified by competitive accesses • If NO, store successively • If YES, mark a failure status in a register implicitly   Wenxuan Yin-NAS 2011

  24. Synchronization • Occur between nodes in SMVN initialization • Master node initializes the shared memory pool, others must wait until the pool is available • When master is ready HT G0 G1 Broadcast ready status Virtual Ethernet HT HT Activate a timer G2 G3 HT • SMVN need restart if timeout Wenxuan Yin-NAS 2011

  25. MPI Processes • Worker Process (WP) • Its number decides the parallel degree • Real working process • Daemon Process (DP) • Its mapping decides WP’s allocation which reflects the parallel granularity • Intra-node or inter-node • At most one DP starting in each node • At least one DP residing in the cluster Wenxuan Yin-NAS 2011

  26. Mapping & Allocation • Mapping DPs into a binary tree connection • WP is allocated to nodes with DPs in breadth-first traversal algorithm DP, 1 ≤ m ≤ 4 WP, n ≥ 1 Node(i): num of WPs on Node I 0 ≤ i ≤ 3 More than 1 OS SMP scheduling! July 2011 Wenxuan Yin-NAS 2011

  27. Real Platform Port MPICH2 library in our real system Based on socket interface supported by SMVN Shared Memory Virtual Network Godson-3A SMP Node July 2011 Wenxuan Yin-NAS 2011

  28. Performance tests • Benchmark • OMB micro-benchmarks for MPI IPC evaluation • We choose two metrics • Ping-pong latency • Unidirection bandwidth • Performance comparison between • Inter-node vs. intra-node • Cached vs. uncached July 2011 Wenxuan Yin-NAS 2011

  29. Testbed Setup • Towards the embedded environment • Frequency: 525MHz • Cache size • L1: 64KB×2 (including instruction and data) • L2: 4MB • Memory size • local in real-time OS kernel is 256MB • shared for SMVN buffer is 2MB • DDR2 working at 200MHz • HT frequency: 800MHz July 2011 Wenxuan Yin-NAS 2011

  30. Results-Latency cliffy smooth Basic latency July 2011 Wenxuan Yin-NAS 2011

  31. Results-Bandwidth 32.5MB/s 84% 27.3MB/s July 2011 Wenxuan Yin-NAS 2011

  32. Observations More than twice 84% approximability • Much better than Fast Ethernet (100Mb) typically used in traditional embedded clusters • Cache is helpful! Avoid flush/invalidate by software • Tradeoff between performance and embedded constraints • Narrow the gap between two levels • Even superior than some high-end system although our absolute performance is lower • Introduce shared memory in both intra- and inter-node communications • Compact mesh topology in system July 2011 Wenxuan Yin-NAS 2011

  33. Related Works • Comparison of data transfer methods • User/kernel level shared memory [Buntinas et al.] • High-speed NIC based copy • MPI communication system (shared memory) • Nemesis [Buntinas et al.] • High-performance and good scalability system [Chai et al.] • RDMA system • InfiniBand [Mamidala et al.] • Quadrics QsNetII[Qian et al.] July 2011 Wenxuan Yin-NAS 2011

  34. Conclusion • Proposed a novel shared memory based virtual communication system --- SMVN • Goal: make a uniform infrastructure in different communication levels to implement efficient MPI IPC under embedded constraints • Adequate performace • Compact size, low power consumption, low cost (no NICs, no switches, no cables) • Direction: scalability for large system expansion July 2011 Wenxuan Yin-NAS 2011

  35. Thanks for your attention! Questions?

More Related