Scalable Systems and Technology

Scalable Systems and Technology Einar Rustad Scali AS einar@scali.com http://www.scali.com

Definition of Cluster • The Widest Definition: • Any number of computers communicating at any distance • The Common Definition: • A relatively small number of computers (<1000) communicating at a relatively small distance (within the same room) and used asa single, shared computing resource

Increasing Performance • Faster Processors • Frequency • Instruction Level Parallelism (ILP) • Better Algorithms • Compilers • Manpower • Parallel Processing • Compilers • Tools (Profilers, Debuggers) • More Manpower

Use of Clusters • Capacity Servers • Data Bases • Client/Server Computing • Throughput Servers • Numerical Applications • Simulation and Modelling • High Availability Servers • Transaction Processing

Why Clustering • Scaling of Resources • Sharing of Resources • Best Price/Performance Ratio (PPR) • PPR is Constant with Growing System Size • Flexibility • High Availability • Fault Resilience

Clusters vs SMPs (1) • Programming • A Program written for Cluster Parallelism can run on an SMP right away • A Program written for an SMP can NOT run on a Cluster right away • Scalability • Clusters are Scalable • SMPs are NOT Scalable above a Small Number of Processors

CPU CPU CPU CPU CPU CPU CPU CPU L3C Link L3C Link Memory Memory Memory I/O Why SMPs don´t scale When CPUs cycle at 1GHz and Memory latency is >100nS, 1% Cache Miss implies <50% CPU Efficiency This is an SMP This is NOT an SMP... Interconnect But, You can make all the Memory Equally SLOW….( X-bar complexity grows with # of ports squared)

Use of SMPs Common Access to Shared Resources Processors Memory Storage Devices Running Multiple Applications Running Multiple Instances of the Same Application Running Parallel Applications Use of Clusters Common Access to Shared Resources Processors Distributed Memory Storage Devices Running Multiple Applications Running Multiple Instances of the Same Application Running Parallel Applications Clusters vs SMPs (2)

Single System Image • One big advantage of SMPs is the Single System Image • Easier Administration and Support • But, Single Point of Failure • Scali´s ”Universe” offers Single System Image to the Administrators and Users • As Easy to Use and Support as an SMP • No Single Point of Failure (N-copies of the same OS) • Redundancy in ”Universe” Architecture

Clustering makes Mo(o)re Sense • Microprocessor Performance Increases 50-60% per Year • 1 year lag: 1.0 WS = 1.6 Proprietary Units • 2 year lag: 1.0 WS = 2.6 Proprietary Units • Volume Disadvantage • When Volume Doubles, Cost is reduced to 90% • 1,000 Proprietary Units vs 1,000,000 SHV units=> Proprietary Unit 3 X more Expensive • 2 years lag and 1:100 Volume Disadvantage => 7 X Worse Price/Performance

Why Do We Need SMPs? • Small SMPs make Great Nodes for building Clusters! • The most Cost-Effective Cluster Node is a Dual Processor SMP

Mission Scali is dedicated to making State-of-the-art Middleware And System Management Software The key enabling SW technologies for building Scalable Systems

ASP´s ISP´s DepartmentalServers E-commerce/Databases Scalable Systems Scali Software PC Technology Interconnect Linux OS Basic Technologies Application Areas

Seismic Database CFD ASPs FEM Web Servers Platform Attraction

Sys Adm GUI Application Conf. server System Monitor MPI ICM Operating System Hardware Technology • High Performance implementation of MPI • ICM - InterConnect Manager for SCI • Parallel Systems configuration server • Parallel Systems monitoring • Expert knowledge in • Computer Architecture • Processor and Communication hardware • Software design and development • Parallelization • System integration and packaging

Key Factors • High Performance Systems Need • High Processor Speed • High Bandwidth Interconnect • Low latency Communication • Balanced Resources • Economy of Scale Components • Establishes a new Standard for Price/Performance

Software Design Strategy • Client - Server Architecture • Implemented as • Application level modules • Libraries • Daemons • Scripts • No OS modifications

Advantages • Industry Standard Programming Model - MPI • MPICH Compatible • Lower Cost • COTS based Hardware = lower system price • Lower Total Cost of Ownership • Better Performance • Always ”Latest & Greatest” Processors • Superior Standard Interconnect - SCI • Scalability • Scalable to hundreds of Processors • Redundancy • Single System Image to users and administrator • Choice of OS • Linux • Solaris • Windows NT

Fault Tolerant High Bandwidth Low Latency Multi-Thread safe Simultaneous Inter/-Intra-node operation UNIX command line replicated Exact message size option Manual/debugger mode for selected processes Explicit host specification Job queuing PBS, DQS, LSF, CCS, NQS, Maui Conformance to MPI-1.1 verified through 1665 MPI tests Scali MPI - Unique Features

Initialization Processing Storing Results Communication Computation P1 P2 P3 P4 Parallel Processing Constraints Overlaps in Processing

System Interconnect • Main Interconnect: • Torus Topology • SCI - IEEE/ANSI std. 1596 • 667MB/s/segment/ring • Shared Address Space • Maintenance and LAN Interconnect: • 100Mbit/s Ethernet

Distributed Switching: PCI-bus PSB B-Link LC3 LC3 Horizontal SCI Ring Vertical SCI Ring 2-D Torus Topology

Scalability with 33MHz/32bit PCI

Scalability with 66MHz/64bits PCI

Paderborn PSC2 12 x 8 Torus 192 Processors 450MHz 86.4GFlops PSC1 8 x 4 Torus 64 Processors 300MHz 19.2GFlops

MPI_Alltoall()

MPI_Barrier()

Versus Myrinet (1)

Versus Origin 2000 (1)

Versus Origin 2000 (2)

Remote Workstation Control Node (Front-end) 4x4 2D Torus SCI cluster 3 GUI GUI S C Server daemon SCI TCP/IP Socket Node daemon System Architecture

33 31 14 24 34 44 13 23 41 43 12 22 32 42 11 21 Fault Tolerance • 2D Torus topology • more routing options • XY routing algorithm • Node 33 fails (3) • Nodes on 33’s ringlets becomes unavailable • Cluster fractured with current routing setting

22 24 43 13 23 42 12 34 41 11 21 31 44 14 32 Fault Tolerance • Rerouting with XY • Failed node Logically remapped to a corner • End-point ID’s unchanged • Applications can continue • Problem: • To many working nodes unused 33

22 24 43 13 23 42 12 34 41 11 21 31 44 14 32 Fault Tolerance • Scali advanced routing algorithm: • From the Turn Model family of routing algorithms • All nodes but the failed one can be utilised as one big partition 33

The Scali Universe

System Management

Software Configuration Management Nodes are categorised once,from then on, new software is installed by one mouse Click, or with a single command.

System Monitoring

Products (1) • Platforms • Intel Ia32/Linux • Intel Ia32/Solaris • Alpha/Linux • SPARC/Solaris • Ia64/Linux • Middleware • MPI 1.1 • MPI 2 • IP • SAN • VIA • Cray shmem

Products (2) • ”TeraRack” Pentium • Each Rack: • 36 x 1U Units • Dual PIII 800MHz • 57.6GFlops • 144GBytes SDRAM • 8.1TBytes Disk • Power Switches • Console Routers • 2-D Torus SCI

Scalable Systems and Technology

Scalable Systems and Technology

Presentation Transcript

Parallel Scalable Operating Systems

Designing Highly Scalable OLTP Systems

Telecommuncation Systems and Technology

Scalable Knowledge Representation and Reasoning Systems

IBM Systems and Technology

Technology and Systems

Scalable Ontology-Based Information Systems

Scalable Cluster Management: Frameworks, Tools, and Systems

Performance Technology for Scalable Parallel Systems

Scalable and reliable wireless sensor network systems

Scalable Cache Coherent Systems

Scalable and Dynamic Quorum Systems

Designing Highly Scalable OLTP Systems

Scalable Systems Software Project

TELECOMMUNICATIONS SYSTEMS AND TECHNOLOGY

Towards Scalable Pub/Sub Systems

Scalable Data Science Systems

Scalable Systems Software Project

Scalable Cache Coherent Systems

Scalable Ontology-Based Information Systems

Scalable Cache Coherent Systems

Choose Scalable Sewage Treatment Systems