Madeleine

Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Application Model Programmingenvironment Abstraction Middlelevelinterface Software stack Hardware control Lowlevelinterface Network Objective Rational task assignment in high-performance communication stacks

A communication supportfor clusters and multi-clusters Madeleine

Features Abstract interface • Programmation by contract • Specification of constraints • Freedom for optimization • Active software support • Dynamic optimization • Adaptivity • Transparency

Interface Definitions • Connection • Uni-directional point-to-point link • FIFO ordering • Channel • Graph of connections • Multiplexing unit • Network virtualization Process Connection Channel

Communication model Characteristics • Model • Message passing • Incremental message builing • Expressiveness • Control of data blocs by flags • Contract between the programmer and the interface Express

Primitives Main commands • Send • mad_begin_packing • mad_pack • … • mad_pack • mad_end_packing • Receive • mad_begin_unpacking • mad_unpack • … • mad_unpack • mad_end_unpacking

Message building • Commands • Mad_pack(cnx, buffer, len, pack_mode, unpack_mode) • Mad_unpack(cnx, buffer, len, pack_mode, unpack_mode) • Send contract options (send modes) • Send_CHEAPER • Send_SAFER • Send_LATER • Receive contract options (receive modes) • Receive_CHEAPER • Receive_EXPRESS • Constraints • Strictly symmetrical pack/unpack sequences • Triplets (len, pack_mode, unpack_mode) identical for send and for receive • Data consistency

Send Send_SAFER Send_LATER Send_CHEAPER Pack Modification ? End_packing

Contract between the programmer and the interface Send_SAFER/ Send_LATER/ Send_CHEAPER • Control of data transfer • Optimization amount • Promises of programmer • Data consistency • Special services • Delayed send • Buffer reuse • Specification at semantical level • Independency: request / implementation

Data available Receive Receive_EXPRESS Receive_CHEAPER Unpack After Unpack Data available Availability? End_unpacking

Message structuring Receive_CHEAPER / Receive_EXPRESS • Receive_EXPRESS • Mandatory immediate receive • Interpretation/extraction of message • Receive_CHEAPER • Free reception of block • Message contents Express

Two-layered model Buffer management Data processing code reuse Hardware abstraction Modular approach Buffer management modules Drivers Transmission modules Organization Interface BMM BMM Buffermanagement Driver Driver TM TM TM Networkmanagement Network

Drivers Network management layer • Data transfers • Send, receive • Group transfers • Transfer method selection • Choice function

Transmission modules • Depends on the network • One module per transfer method • Pilote GM: 2 TM • Pilote BIP: 2 TM • Pilote SCI: 3 TM • Pilote VIA: 3 TM • Associated to a buffer management module

Pack Transmission modules Madeleine Interface BMM TM BMM TM Thread Network Process

Buffers Generic management layer • Virtual buffers • Static • Dynamic • Groups • Aggregations • Splitting

Buffer management modules • Buffer type • Static/dynamic • Aggregation mode • Without • Sequential aggregation • Half-sequential aggregation • Aggregation shape • Symmetrical/non-symmetrical

Status Network drivers Quadrics, MX, GM, SISCI,MPI, TCP, VRP VIA, UDP, SBP, BIP Distribution Licence GPL Availability Linux IA32, IA64, x86-64, Alpha, Sparc, PowerPC MacOS/X G4 Solaris IA32, Sparc Aix PowerPC Windows NT IA32 Implementation

Tests –current plaform Test environment • Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB • Giga-Ethernet • SISCI/SCI • MX & GM /Myrinet • Quadrics Elan4 Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests

Latency Latency (µs) Packet size (bytes)

Bandwidth Bandwidth (MB/s) Transfer time (bytes)

Tests –older platform Testing environments • Cluster of PC bi-Pentium II 450 MHz, 128 MB • Fast-Ethernet • SISCI/SCI • BIP/Myrinet Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests

SISCI/SCI – latency Latency (µs) Packet size (bytes)

SISCI/SCI – bandwidth Bandwidth (MB/s) Packet size (bytes)

SISCI/SCI – latencyPacks/messages Latency (µs) Packet size (bytes)

SISCI/SCI – bandwidthPacks/messages Bandwidth (MB/s) Packet size (bytes)

Users –MPICH/Madeleine API MPI Generic interface: point-to-point communication, collective communication, groups building Abstract Device Interface (ADI) Generic interface: data type management, request queues management CH_MAD SMP_PLUG CH_SELF Communication Local communication Local loops Polling loops Internal MPICH protocols Madeleine Communication TCP UDP BIP SISCI GM MX QSNET Multi-protocol support

MPICH/Mad/SCI – Latency Latency (µs) Packet size (bytes)

MPICH/Mad/SCI – bandwidth Bandwidth (MB/s) Packet size (bytes)

Application MPI ORB JVM Circuit VSock Padico Net Access Thread Padico manager micro-kernel Padico Core Padico Task Manager Madeleine Marcel Communication TCP UDP BIP SISCI GM MX QSNET Multi-protocol support Users –Padico

Padico – latency Latency (µs) Packet size (bytes)

Padico – bandwidth Bandwidth (MB/s) Packet size (bytes)

Conclusion Unified communication support • Abstract interface • Contract-based programming • Modular/adaptive architecture • Dynamic optimization • Transparent multi-cluster support

On-going/future work Programming interface • Message structuration • Near-future information exploitation • Pathological cases reduction • Fault tolerance Communication sequences processing • Code specialization, compilation Session management • Deployment • Dynamicity • Fault-tolerance • Scaling

? Madeleine I Madeleine II Madeleine III Madeleine IV

Some limitations of Madeleine (version III) Objectives for a new Madeleine • Some optimizations are out of reach for Madeleine • The optimization range is to narrow • Need information about what is coming in the near future • Need to be more liberal in allowing permutations in the packet flow • Optimizations strategies involve too much work from the driver programmer • Need to share more of strategic code • Need to easily evaluate and even mix various strategies • Optimization sequences are synchronous with the application program • Need to synchronize optimization sequences with the NIC

Constraints Tracks Proposal: Madeleine IV Tactics Optimizer thread Sender thread Hardware-specificparameters Driver Strategies Network Optimizer thread

Concepts Definitions • Tracks • Hardware multiplexing units mapping (tags) • Main track • Control packets, small packets, … • Optional auxiliary tracks • Other traffics (large messages, …) • Tactics • Basic optimization operations • Permutation, aggregation, piggybacking, association, splitting, track change • Strategies • Set of tactics towards one optimization goal • Constraints • Tactics compatibility • Send/receive modes

Packet headers Giving up a little bit of raw efficiency to get much more flexibility • Opportunist packet aggregation/permutation • Inside a single packet flow • Across multiple packet flows • Side effects • Control packets • Rendez-vous • ACKs • Piggybacking • Multiplexing

Concurrent communication progression Communication scheduling • The NIC is responsible for requesting work • Packets are built when the NIC is ready • The optimizer gets more time to gather up-to-date optimization clues

Tests Test environment • Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB • MX / Myrinet Testing procedure • Test: 1000 x (send + receive) • Result: ½ x average of 5 tests

Test – Latency Latency (µs) Packet size (bytes)

Test – Bandwidth Bandwidth (MB/s) Packet size (bytes)

Test – Latency when aggregating short packets Latency (µs) Packet size (bytes)

Opportunist aggregation on RDV Aggregating a short packet with a RDV request for a long packet • No gain with MX/Myrinet • Madeleine III • Latency: 310 µs • Bandwidth: 201 MB/s • Madeleine IV • Latency: 314 µs • Bandwidth: 200 MB/s • MX flow control gets in the way

Conclusion • A new architecture for optimizing communication • Wider optimization spectrum • Better interactions between software and harware • A platform for experimenting optimizations • Optimization tactics • A prototype implemented on top of MX/Myrinet • Proof of concept

On-going and future work • Optimization • Tactic combinations • Automatic strategy selection • External strategies (plug-ins) • Interface expressiveness • Extended packs • One-sided communication • Load-balancing, multi-rail • Benefit from all available links

Madeleine

Madeleine

Presentation Transcript

Carl Thompson Madeleine Sobb

Madeleine Sophie Barat

Dr. madeleine m. Leininger

Madeleine Thomson and Ashley Curtis

Madeleine Chéruit

Kitty Cat By Madeleine

Madeleine Albright

☺!!MADELEINE!!☺

Madeleine Leininger

Madeleine Leininger

Teenager -Minna madeleine parkkila

Madeleine Kopp

Writing Process Madeleine L’Engle

Madeleine L’engle

Madeleine Leininger

Villa Parcs - Parc Madeleine