Advanced Parallel Processing in Low-Cost Platforms

Department of Computer Architecture and Technology University of Granada (Spain) EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND COMMUNICATION IN LOW-COST HIGH PERFORMANCE PLATFORMS Anguita, M.; Cañas, A.; Díaz, A.F.; Fernández, F.J.; Ortega, J.; Prieto, A. Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images and Signals: High Resolution and Low Resolution in Data and Information Grids (21-22 February, Granada, Spain)

Outline of the talk • Introduction: Grid computing • Communication performance improvement • CLIC on Fast Ethernet • CLIC on Gigabit Ethernet • Grid- and cluster-aware program development environments • PVMTB and MPITB performance • Application examples: wavelets, pH control, nanoelectronics Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

The goal: (platform researching point of view) To provide a transparent access to the available computing resources (including supercomputers, storage systems,...) and other geographically distributed devices and scientific instruments via a networked environment Introduction: Grid Computing As the available bandwidths of the networks increase, the location of the computing power becomes less relevant It would be possible to use networks of computers as a single computing resource for large-scale applications Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Introduction: our goals in this context • Efficient exploitation of parallelism (at different levels) in low cost platforms based on clusters of computers: • Improvement of communication bandwidths available to applications • High-level programming environments for parallel program development Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Outline of the talk • Introduction: Grid computing • Communication performance improvement • CLIC on Fast Ethernet • CLIC on Gigabit Ethernet • Grid- and cluster-aware program development environments • Performance • Application examples: wavelets, pH control, nanoelectronics Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Improving communication in Clusters (II) CLIC (Communication in Linux Clusters) protocol • Reliable Transport system suited for Cluster Computing • Developed on Linux (kernel module) • Optimizes OS support for communication: (scheduler, NIC drivers, kernel functions) • Upper layer systems (PVM, MPI,…) can be efficiently used on top of CLIC • CLIC improves the performance of the communications so that user-level applications can take advantage of network features (better latency & bandwidth, Broadcast, Channel Bonding). Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Improving communication in Clusters (III) CLIC avoids the TCP/IP stack LAM-MPI has been efficiently implemented on CLIC Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Improving communication in Clusters (V) • High improvement w.r.t. MPI/TCP and PVM/TCP • MPI/CLIC provides a performance similar to CLIC Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Network technology trends: Gigabit networks Ethernet switch Clusters 10 Gigabit/s 10 Gigabit/s 10 Gigabit/s Servers Hard disks 10 Gigabit/s 10 Gigabit/s Fibre channel switch Ethernet switch Infiniband switch Datacenters and networks are moving towards 1-10 Gigabit technologies Infiniband array Fibre channel array Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

CLIC on Gigabit Ethernet (I) • Techniques implemented on CLIC to take advantage of the gigabit network bandwidths: • Jumbo frames: use MTUs longer (up to MTU=9000 bytes) than the Ethernet standard (MTU=1500 bytes) • Reduce the number of interrupts and the overhead related with the communication protocol processing • Coalesced interrupt: the NIC only interrupts the processors after a given time interval, or a given number of packet arrivals. • Reduce the number of generated interrupts (at the cost of a delay in the reception) • 0-Copy: data to be sent are copied directly from the user memory space to the NIC (to receive data, only one copy is needed) Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

CLIC on Gigabit Ethernet (II) Data to be sent go from user memory to system memory, then another copy is done to build the packets (2 copies), and then to the NIC Data to be sent go from user memory to system memory (1-copy), in order to build the packets, and then to the NIC CLIC takes advantage of the new drivers for Gigabit network cards: Data to be sent can go directly from user memory to the NIC Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

CLIC on Gigabit Ethernet (III) Latency=36μs (messages of 0 bytes) 50% of maximum bandwidth: 4 KBytes     Comparison of bandwidths provided by CLIC on Gigabit Ethernet with 0-copy/1-copy and MTU=9000/1500: Using MTU=9000 bytes has more impact than using 0-copy Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Further Improvements: Intelligent NIC The emergence of fast, cheap embedded processors allows the use of Intelligent Network Interface Cards (INIC), including one or more processors, to assist communication by offloading protocol processing: the entire communication protocol is configured and moved to the INIC • Consequences: • The load on the CPU (from the communication process) is reduced • It is possible for the applications to take advantage from overlapping communication and computation. • The card becomes protocol-aware and can interact with the network without CPU intervention • The overall protocol latency is reduced as short messages do not have to cross the peripheral (PCI) bus and the CPU does not have to service an interrupt and perform a context switch for each one. • The INIC can transfer data more efficiently to/from the CPU (small messages can be reassembled in the INIC and block-transferred to the main memory rather than a sequence of short DMA transfers). • Implementing the communication protocols in the INIC contributes to reduce the effect of the I/O (PCI) bus bottleneck. Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Outline of the talk • Introduction: Grid computing • Communication performance improvement • CLIC on Fast Ethernet • CLIC on Gigabit Ethernet • Grid- and cluster-aware program development environments • PVMTB and MPITB performance • Application examples: wavelets, pH control, nanoelectronics Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

MATLAB interpreted environment • M-files • interpreted • fast-prototyping • save & run P-code M-file interactive try-and-error integrated debugger • MEX-files • compiled • lower-level • computing-intensive • access to libraries • data export/import • hardware control C MEX C source intermediate compile/link step involved configuration normal debugger involved breakpoints Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Parallel Toolboxes • Toolbox of MEX files, each PVM/MPI routine has its own MEX MATLAB application PVM app MPI app PVMTB MPITB MATLAB PVM MPI Operating System Network PVMTB: 93 cmds (interfaces 86 PVM calls) MPITB: 153 cmds (interfaces 135 MPI calls) • Have been used for: • signal processing: wavelet transform (UGR, Spain) • automatic control: real-time pH control (UNED, Spain) • chemical engineering: chemical manufact. simul. (Carnegie Mellon, USA) • nanoelectronics: nanoscale device simul. (CELAB, Purdue, USA) Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Performance results “Performance of Parallel MATLAB Toolboxes”, VecPar’02, Porto, Portugal Overhead 20% @ 1500B Latency 1.8x Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Application: pH control S. Dormido (UNED, Spain) ASCC’02 Suntec, Singapore “Dynamic Programming on clusters for solving Control problems” Cluster “smaug”: 16 Athlon K7 500MHz, 128MB, 7GB HD server with 20GB HD, 2NICs, KVM Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

pH control: results Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Application: nanoelectronics S. Goasguen (Purdue, USA) IEEE-NANO’02 Arlington-VA, USA “Parallelization of nanoMOS2.0 using a 100-nodes Linux cluster” Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

nanoMOS:cluster “superman” Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

nanoMOS: results efficiency = 98.0% e ~ 95.3% e ~ 88.8% e ~ 84.2% Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Conclusions • Lightweight protocol CLIC: • Portable to any Linux machine, benefits from OS resources/Gbps driver features • Facilitated tracking of technology advances, it’s not NIC/CPU dependant • Exposes latency/bandwidth improvements at application level (MPI-PVM/CLIC) • Foreseen: INICs with embedded processors to offload protocol load from CPU and avoid I/O bus bottleneck (PCI) • Parallel Toolboxes PVMTB-MPITB: • Fast learning / Fast prototyping of parallel applications on clusters • Useful for research: small overhead, acceptable efficiency even with 120 CPUs • Foreseen: efficiency improvement by compiling application M-files (MATLAB compiler) and by linking them against MPI/CLIC and PVM/CLIC. • Grids are made of clusters / other resources. Inside them, we want/need: • Communication performance, promptly tracking technology advances (Gbps, INICs) • Parallel application development environments, benefiting from those improvements Efficient parallel processing, program development and communication in low-cost high performance platforms Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Advanced Parallel Processing in Low-Cost Platforms