920 likes | 1.1k Views
Madeleine. Olivier Aumage. Runtime Project INRIA – LaBRI Borde aux, France. Application. Model. Programming environment. Abstraction. Middle level interface. Software stack. Hardware control. Low level interface. Network. Objective.
E N D
Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France
Application Model Programmingenvironment Abstraction Middlelevelinterface Software stack Hardware control Lowlevelinterface Network Objective Rational task assignment in high-performance communication stacks
A communication supportfor clusters and multi-clusters Madeleine
Features Abstract interface • Programmation by contract • Specification of constraints • Freedom for optimization • Active software support • Dynamic optimization • Adaptivity • Transparency
Interface Definitions • Connection • Uni-directional point-to-point link • FIFO ordering • Channel • Graph of connections • Multiplexing unit • Network virtualization Process Connection Channel
Communication model Characteristics • Model • Message passing • Incremental message builing • Expressiveness • Control of data blocs by flags • Contract between the programmer and the interface Express
Primitives Main commands • Send • mad_begin_packing • mad_pack • … • mad_pack • mad_end_packing • Receive • mad_begin_unpacking • mad_unpack • … • mad_unpack • mad_end_unpacking
Message building • Commands • Mad_pack(cnx, buffer, len, pack_mode, unpack_mode) • Mad_unpack(cnx, buffer, len, pack_mode, unpack_mode) • Send contract options (send modes) • Send_CHEAPER • Send_SAFER • Send_LATER • Receive contract options (receive modes) • Receive_CHEAPER • Receive_EXPRESS • Constraints • Strictly symmetrical pack/unpack sequences • Triplets (len, pack_mode, unpack_mode) identical for send and for receive • Data consistency
Send Send_SAFER Send_LATER Send_CHEAPER Pack Modification ? End_packing
Contract between the programmer and the interface Send_SAFER/ Send_LATER/ Send_CHEAPER • Control of data transfer • Optimization amount • Promises of programmer • Data consistency • Special services • Delayed send • Buffer reuse • Specification at semantical level • Independency: request / implementation
Data available Receive Receive_EXPRESS Receive_CHEAPER Unpack After Unpack Data available Availability? End_unpacking
Message structuring Receive_CHEAPER / Receive_EXPRESS • Receive_EXPRESS • Mandatory immediate receive • Interpretation/extraction of message • Receive_CHEAPER • Free reception of block • Message contents Express
Two-layered model Buffer management Data processing code reuse Hardware abstraction Modular approach Buffer management modules Drivers Transmission modules Organization Interface BMM BMM Buffermanagement Driver Driver TM TM TM Networkmanagement Network
Drivers Network management layer • Data transfers • Send, receive • Group transfers • Transfer method selection • Choice function
Transmission modules • Depends on the network • One module per transfer method • Pilote GM: 2 TM • Pilote BIP: 2 TM • Pilote SCI: 3 TM • Pilote VIA: 3 TM • Associated to a buffer management module
Pack Transmission modules Madeleine Interface BMM TM BMM TM Thread Network Process
Buffers Generic management layer • Virtual buffers • Static • Dynamic • Groups • Aggregations • Splitting
Buffer management modules • Buffer type • Static/dynamic • Aggregation mode • Without • Sequential aggregation • Half-sequential aggregation • Aggregation shape • Symmetrical/non-symmetrical
Status Network drivers Quadrics, MX, GM, SISCI,MPI, TCP, VRP VIA, UDP, SBP, BIP Distribution Licence GPL Availability Linux IA32, IA64, x86-64, Alpha, Sparc, PowerPC MacOS/X G4 Solaris IA32, Sparc Aix PowerPC Windows NT IA32 Implementation
Tests –current plaform Test environment • Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB • Giga-Ethernet • SISCI/SCI • MX & GM /Myrinet • Quadrics Elan4 Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests
Latency Latency (µs) Packet size (bytes)
Bandwidth Bandwidth (MB/s) Transfer time (bytes)
Tests –older platform Testing environments • Cluster of PC bi-Pentium II 450 MHz, 128 MB • Fast-Ethernet • SISCI/SCI • BIP/Myrinet Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests
SISCI/SCI – latency Latency (µs) Packet size (bytes)
SISCI/SCI – bandwidth Bandwidth (MB/s) Packet size (bytes)
SISCI/SCI – latencyPacks/messages Latency (µs) Packet size (bytes)
SISCI/SCI – bandwidthPacks/messages Bandwidth (MB/s) Packet size (bytes)
Users –MPICH/Madeleine API MPI Generic interface: point-to-point communication, collective communication, groups building Abstract Device Interface (ADI) Generic interface: data type management, request queues management CH_MAD SMP_PLUG CH_SELF Communication Local communication Local loops Polling loops Internal MPICH protocols Madeleine Communication TCP UDP BIP SISCI GM MX QSNET Multi-protocol support
MPICH/Mad/SCI – Latency Latency (µs) Packet size (bytes)
MPICH/Mad/SCI – bandwidth Bandwidth (MB/s) Packet size (bytes)
Application MPI ORB JVM Circuit VSock Padico Net Access Thread Padico manager micro-kernel Padico Core Padico Task Manager Madeleine Marcel Communication TCP UDP BIP SISCI GM MX QSNET Multi-protocol support Users –Padico
Padico – latency Latency (µs) Packet size (bytes)
Padico – bandwidth Bandwidth (MB/s) Packet size (bytes)
Conclusion Unified communication support • Abstract interface • Contract-based programming • Modular/adaptive architecture • Dynamic optimization • Transparent multi-cluster support
On-going/future work Programming interface • Message structuration • Near-future information exploitation • Pathological cases reduction • Fault tolerance Communication sequences processing • Code specialization, compilation Session management • Deployment • Dynamicity • Fault-tolerance • Scaling
? Madeleine I Madeleine II Madeleine III Madeleine IV
Some limitations of Madeleine (version III) Objectives for a new Madeleine • Some optimizations are out of reach for Madeleine • The optimization range is to narrow • Need information about what is coming in the near future • Need to be more liberal in allowing permutations in the packet flow • Optimizations strategies involve too much work from the driver programmer • Need to share more of strategic code • Need to easily evaluate and even mix various strategies • Optimization sequences are synchronous with the application program • Need to synchronize optimization sequences with the NIC
Constraints Tracks Proposal: Madeleine IV Tactics Optimizer thread Sender thread Hardware-specificparameters Driver Strategies Network Optimizer thread
Concepts Definitions • Tracks • Hardware multiplexing units mapping (tags) • Main track • Control packets, small packets, … • Optional auxiliary tracks • Other traffics (large messages, …) • Tactics • Basic optimization operations • Permutation, aggregation, piggybacking, association, splitting, track change • Strategies • Set of tactics towards one optimization goal • Constraints • Tactics compatibility • Send/receive modes
Constraints Tracks Proposal: Madeleine IV Tactics Optimizer thread Sender thread Hardware-specificparameters Driver Strategies Network Optimizer thread
Packet headers Giving up a little bit of raw efficiency to get much more flexibility • Opportunist packet aggregation/permutation • Inside a single packet flow • Across multiple packet flows • Side effects • Control packets • Rendez-vous • ACKs • Piggybacking • Multiplexing
Concurrent communication progression Communication scheduling • The NIC is responsible for requesting work • Packets are built when the NIC is ready • The optimizer gets more time to gather up-to-date optimization clues
Tests Test environment • Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB • MX / Myrinet Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests
Test – Latency Latency (µs) Packet size (bytes)
Test – Bandwidth Bandwidth (MB/s) Packet size (bytes)
Test – Latency when aggregating short packets Latency (µs) Packet size (bytes)
Opportunist aggregation on RDV Aggregating a short packet with a RDV request for a long packet • No gain with MX/Myrinet • Madeleine III • Latency: 310 µs • Bandwidth: 201 MB/s • Madeleine IV • Latency: 314 µs • Bandwidth: 200 MB/s • MX flow control gets in the way
Conclusion • A new architecture for optimizing communication • Wider optimization spectrum • Better interactions between software and harware • A platform for experimenting optimizations • Optimization tactics • A prototype implemented on top of MX/Myrinet • Proof of concept
On-going and future work • Optimization • Tactic combinations • Automatic strategy selection • External strategies (plug-ins) • Interface expressiveness • Extended packs • One-sided communication • Load-balancing, multi-rail • Benefit from all available links
Constraints Tracks Proposal: Madeleine IV Tactics Optimizer thread Sender thread Hardware-specificparameters Driver Strategies Network Optimizer thread