140 likes | 298 Views
Interconnects for more than MPI. David Greenberg (special thanks to Duncan Roweth of Quadrics and Bill Carlson of CCS). Requirements. Last five years Low latency - aim for 10 us High Bandwidth - aim for 100 MB/s Support MPI - ad hoc (eg. collective support) Need to start aiming for
E N D
Interconnects for more than MPI David Greenberg (special thanks to Duncan Roweth of Quadrics and Bill Carlson of CCS)
Requirements • Last five years • Low latency - aim for 10 us • High Bandwidth - aim for 100 MB/s • Support MPI - ad hoc (eg. collective support) • Need to start aiming for • Support for low overhead protocols (eg. UPC) • Low latency + overhead - aim for 1us + 100ns • Scalable Bandwidth - match memory bus on node and full bisection bandwidth
Design suggestions • NIC does VM translation (possibly aided by connection table) • Load-store model: • extend write-queue • expand TLB span • support split transaction loads • Error check network, retry once, signal app • Source routing
Available today • T3e: meets most needs, vendor specific, discontinued, some scaling issues • ASCI red: good for MPI, scales well, no RDMA, vendor specific, discontinued • Myrinet: good bandwidth, vendor neutral, potential for custom mods, some scaling issues, overhead and latency so-so with GM • VIA: not ready for HPC • Quadrics: see next slide
Quadrics in my humble opinion • Design is on right track • Bandwidth fine • Latency limitted by PCI-bus • Overhead very good • Scalability currently suspect but fixable • Currently tied to Sun, Compaq but Linux port coming soon. • Reliability, availability, servicability good • Price remains to be seen.
Quadrics: Performance Overview • Line rate 100 Mbytes/s • Peak data rate (adapter memory) 340 Mbytes/s • Compaq DS20 33Mhz/64bit PCI 200 Mbytes/s • Sun E450 66Mhz/64bit PCI 165 Mbytes/s • DMA write 32bytes 2.5s • MPI send 5s
UPC on Quadrics: early numbers Oneway: 0 1,2 3. Twoway: 01 2 3 0. Saturation bandwidth = 169MB/s one-way, 68MB/s two-way
UPC on T3E as comparison Production speeds on large (>100 node machine). Times in nsec per word transferred.
Breaking down the latency Reducing latency Architectural improvements Main clock speed Adapter clock speed PCI performance Don’t expect much Direct connect Network Performance
Bi-Sectional Bandwidth • Full fat tree Order(N) bi-sectional bandwidth • 340 Mbytes/sec/link in each direction • PCI bridge critical your mileage may vary • Headroom in the network reduces contention
Operating Systems Solaris 2.6 Digital Unix 5.0 Applications development Standard compilers MPI, CRAY-Shmem Totalview, Vampir Quadrics: Software Overview
Quadrics: Linux Status & Plans • Under development with key customer • Elan driver and libraries running, IP running, RMS port in progress • GPL the base Elan driver and kernel changes • URL of Linux Driver coming soon. • Quadric’s list of Linux modifications necessary • Elan specific: VM hooks to map Elan command port, Callbacks notifying Elan of PTE load/unload. • Generic: fork/exec/exit callbacks, system calls in loadable modules