Building Scalable, High Performance Cluster and Grid Networks: The Role Of Ethernet

Building Scalable, High Performance Cluster and Grid Networks: The Role Of Ethernet Thriveni Movva CMPS 5433

Overview • About Grids/Clusters • Uses of Grid • Differences between Grids/Clusters • Benefits of Grid • Grid Architecture • Building Ethernet Network for Grids/Clusters • Examples of Ethernet Grids/Clusters • Conclusion/Summary

What Is A Grid Computer? • Hardware and Software System • Integrates a collection of distributed system components • Computer systems • Storage etc • Solves large-scale computation problems • Appear to the user as a single, large “Virtualized” computing system • Consists of geographically dispersed computers

What is a Cluster? • Multiprocessor system consisting of co-located computers and storage • Viewed as though it were a single computer • Connected through fast local area networks (Localized within a room or building) • Provides more speed and/or reliability than a single computer • Cost-effective than single computers of comparable speed or reliability.

Uses of Grid Computing • Computer systems and other resources • not constrained to be dedicated to individual users or applications • Can be made available for dynamic pooling/sharing according to the changing needs • Using internet, Grid-based resource sharing and collaborative problem solving can be extended to multi-institutional “Virtual Organizations”

Differences between Grids/Clusters • Grids: • dispersed over a local/metropolitan/WAN • span administrative boundaries • focus on problems in distributing computing/resource sharing • distribute workloads among different machine types and OS • Clusters: • localized within a room/building • single administration • focus on compute-intensive problems and HPC • homogenous (single type of processor and OS)

Benefits Of The Grid • Grid Computing offers a number of Potential uses and benefits that can be broadly categorized in the following way: • High Performance Computing (HPC) • Data Federation and Collaboration • Resource Allocation and Optimization

High Performance Computing (HPC) • Computationally intensive parallelizable applications can be benefited • Uses computer array of numerous commodity or specialized systems • Most applications of the Grid fall into HPC classification • Advantages Of HPC: • Cost effective solutions to critical problems • High return on investment • Solves problems that were previously insolvable within given time and cost • Solve problems too large for conventional supercomputers • Fields in which the HPC Grid has successfully addressed a wide range of computational problems include: • Climate/weather/ocean modeling and simulation, Internet search engines, Signal/image processing, Pharmaceutical research, Military forces simulation

Data Federation and Collaboration • Consolidates data from different sources in a single data service • Hides data location, local ownership and infrastructure from the application • No data disruption by local users, applications or data management policies • Facilitates wide range of integrated applications like: • Corporate performance dashboards • Marketing analysis tools • Customer service applications • Data mining applications

Resource Allocation and Optimization • Sharing of computing and storage to improve resource utilization • For Example, the applications and the batch jobs can be transferred to an idle server • Benefits of resource optimization • Reclaims much of the stranded capacity of the computing infrastructure • Reduces the level of capital investment • No modification of existing application required

Grid Computing Architecture • Basic architecture of Grid consists of • User Interface • Applications • Grid Middleware • Computing Resources • Grid Network

Applications • Classification of parallel applications • Embarrassingly Parallel Computations (EPC) • Divided into independent parts • Allocated to multiple processors for simultaneous execution • No communication is required between the processors • Example : Testing large integers to determine prime numbers • Parametric and Data Parallel Computations • Also referred to as Nearly Embarrassingly Parallel Computations (NEPC) • Each processor works on independent subset of the data • Data is later gathered by a single process • Examples: Internet search engines • Loosely Coupled Synchronous Parallel Computations • Inter-process communication between small subset of processors before the computation can be completed

Grid Middleware • Gives the Grid the semblance of a single computer system • Provides coordination among computing resources of the Grid • Provides location transparency • Allows the applications to run over a virtualized layer of networked resources • Available from system vendors and independent software vendors • Example: Globus Toolkit

Functions of Middleware • Discovery and monitoring • Discover what resources or services are available • Monitor their status • Resource allocation and management • Matches application requirements to the available computing resources • Creates and schedules remote jobs as required • Ensures optimum load balancing and resource utilization • Security • Shared resources may contain sensitive information • Secures communications, authenticate user identities using SSL/TLS etc • Message Passing System • Used by compute-intensive parallel applications for inter-process communication • Examples: MPI (Message passing interface) and PVM (parallel virtual machine)

Ethernet Networks for Clusters and Grids • Single-switch Clusters • Large Clusters • Ethernet Grid Networks

Single-switch Clusters • Built using a single high-availability Gigabit Ethernet switch/router as the cluster interconnect • The maximum size of a single-switch Ethernet cluster is determined by the non-blocking port capacity of the switch • Current Switch/routers provide interconnect for over 600 GbE connected servers • All server ports configured to be in same subnet

Large Clusters • Built using meshes of Federated Ethernet switches • Ethernet switches use non-blocking, constant Bi-sectional Bandwidth (CBB) topologies • CBB • Provides scalability to support thousands of cluster nodes • Provide high bandwidth connectivity to the network • The core of the cluster provides each node switch with equal load share to avoid blocking of ports

Ethernet Grid Networks(Campus Grid network based on Ethernet switching) • Ethernet allow the cluster to participate in a broader campus or Enterprise Grid structure • Desktop computers, workstations connected to the campus grid network using gbE • Server farms Outside of cluster are connected to site switches using gbE • Goal of campus LANs • gives high priority to general Grid traffic • ensures critical Grid traffic does not incur any added latency

Grid Tools • Tools used to prioritize critical grid traffic • Priority Queuing • The forwarding capacity of a congested port is immediately allocated to any high priority traffic that enters the queue • Rate limiting and policing • Limits the amount of lower priority traffic that enters the network • Weighted Random Early Discard (WRED) • Packet loss can be eliminated if buffers are never allowed to fill to capacity with resulting overflows • Overflows can be avoided by applying WRED to the lower priority traffic • WRED eliminates the possibility of high priority packets arriving at a buffer that is already overflowing with lower priority packets

Examples of Ethernet Cluster/Grids • TeraGrid • Is a multi-institutional effort to build and deploy world’s most comprehensive computing infrastructure for open scientific research • NASA • NASA uses ESDCD “Grid of clusters”, to help scientists increase their understanding of the Earth, the solar system and the universe through computational modeling and processing of space-borne observations

Conclusion/Summary • Ethernet continues to evolve as a highly cost-effective and flexible technology • Majority of parallel and general Grid applications are very well served by the performance characteristics of Ethernet as the cluster/Grid interconnect • In the future, Ethernet end-to-end data transfer bandwidths, message latencies and CPU utilization will improve dramatically due to NIC enhancements • Volume production leading to price decline • These developments expected to improve the overall performance of existing Ethernet clusters/Grids and use of cluster/Grid technology by a broader range of commercial enterprises

Building Scalable, High Performance Cluster and Grid Networks: The Role Of Ethernet