430 likes | 535 Views
Scalable Systems and Technology. Einar Rustad Scali AS einar@scali.com http://www.scali.com. Definition of Cluster. The Widest Definition: Any number of computers communicating at any distance The Common Definition:
E N D
Scalable Systems and Technology Einar Rustad Scali AS einar@scali.com http://www.scali.com
Definition of Cluster • The Widest Definition: • Any number of computers communicating at any distance • The Common Definition: • A relatively small number of computers (<1000) communicating at a relatively small distance (within the same room) and used asa single, shared computing resource
Increasing Performance • Faster Processors • Frequency • Instruction Level Parallelism (ILP) • Better Algorithms • Compilers • Manpower • Parallel Processing • Compilers • Tools (Profilers, Debuggers) • More Manpower
Use of Clusters • Capacity Servers • Data Bases • Client/Server Computing • Throughput Servers • Numerical Applications • Simulation and Modelling • High Availability Servers • Transaction Processing
Why Clustering • Scaling of Resources • Sharing of Resources • Best Price/Performance Ratio (PPR) • PPR is Constant with Growing System Size • Flexibility • High Availability • Fault Resilience
Clusters vs SMPs (1) • Programming • A Program written for Cluster Parallelism can run on an SMP right away • A Program written for an SMP can NOT run on a Cluster right away • Scalability • Clusters are Scalable • SMPs are NOT Scalable above a Small Number of Processors
CPU CPU CPU CPU CPU CPU CPU CPU L3C Link L3C Link Memory Memory Memory I/O Why SMPs don´t scale When CPUs cycle at 1GHz and Memory latency is >100nS, 1% Cache Miss implies <50% CPU Efficiency This is an SMP This is NOT an SMP... Interconnect But, You can make all the Memory Equally SLOW….( X-bar complexity grows with # of ports squared)
Use of SMPs Common Access to Shared Resources Processors Memory Storage Devices Running Multiple Applications Running Multiple Instances of the Same Application Running Parallel Applications Use of Clusters Common Access to Shared Resources Processors Distributed Memory Storage Devices Running Multiple Applications Running Multiple Instances of the Same Application Running Parallel Applications Clusters vs SMPs (2)
Single System Image • One big advantage of SMPs is the Single System Image • Easier Administration and Support • But, Single Point of Failure • Scali´s ”Universe” offers Single System Image to the Administrators and Users • As Easy to Use and Support as an SMP • No Single Point of Failure (N-copies of the same OS) • Redundancy in ”Universe” Architecture
Clustering makes Mo(o)re Sense • Microprocessor Performance Increases 50-60% per Year • 1 year lag: 1.0 WS = 1.6 Proprietary Units • 2 year lag: 1.0 WS = 2.6 Proprietary Units • Volume Disadvantage • When Volume Doubles, Cost is reduced to 90% • 1,000 Proprietary Units vs 1,000,000 SHV units=> Proprietary Unit 3 X more Expensive • 2 years lag and 1:100 Volume Disadvantage => 7 X Worse Price/Performance
Why Do We Need SMPs? • Small SMPs make Great Nodes for building Clusters! • The most Cost-Effective Cluster Node is a Dual Processor SMP
Mission Scali is dedicated to making State-of-the-art Middleware And System Management Software The key enabling SW technologies for building Scalable Systems
ASP´s ISP´s DepartmentalServers E-commerce/Databases Scalable Systems Scali Software PC Technology Interconnect Linux OS Basic Technologies Application Areas
Seismic Database CFD ASPs FEM Web Servers Platform Attraction
Sys Adm GUI Application Conf. server System Monitor MPI ICM Operating System Hardware Technology • High Performance implementation of MPI • ICM - InterConnect Manager for SCI • Parallel Systems configuration server • Parallel Systems monitoring • Expert knowledge in • Computer Architecture • Processor and Communication hardware • Software design and development • Parallelization • System integration and packaging
Key Factors • High Performance Systems Need • High Processor Speed • High Bandwidth Interconnect • Low latency Communication • Balanced Resources • Economy of Scale Components • Establishes a new Standard for Price/Performance
Software Design Strategy • Client - Server Architecture • Implemented as • Application level modules • Libraries • Daemons • Scripts • No OS modifications
Advantages • Industry Standard Programming Model - MPI • MPICH Compatible • Lower Cost • COTS based Hardware = lower system price • Lower Total Cost of Ownership • Better Performance • Always ”Latest & Greatest” Processors • Superior Standard Interconnect - SCI • Scalability • Scalable to hundreds of Processors • Redundancy • Single System Image to users and administrator • Choice of OS • Linux • Solaris • Windows NT
Fault Tolerant High Bandwidth Low Latency Multi-Thread safe Simultaneous Inter/-Intra-node operation UNIX command line replicated Exact message size option Manual/debugger mode for selected processes Explicit host specification Job queuing PBS, DQS, LSF, CCS, NQS, Maui Conformance to MPI-1.1 verified through 1665 MPI tests Scali MPI - Unique Features
Initialization Processing Storing Results Communication Computation P1 P2 P3 P4 Parallel Processing Constraints Overlaps in Processing
System Interconnect • Main Interconnect: • Torus Topology • SCI - IEEE/ANSI std. 1596 • 667MB/s/segment/ring • Shared Address Space • Maintenance and LAN Interconnect: • 100Mbit/s Ethernet
Distributed Switching: PCI-bus PSB B-Link LC3 LC3 Horizontal SCI Ring Vertical SCI Ring 2-D Torus Topology
Paderborn PSC2 12 x 8 Torus 192 Processors 450MHz 86.4GFlops PSC1 8 x 4 Torus 64 Processors 300MHz 19.2GFlops
Remote Workstation Control Node (Front-end) 4x4 2D Torus SCI cluster 3 GUI GUI S C Server daemon SCI TCP/IP Socket Node daemon System Architecture
33 31 14 24 34 44 13 23 41 43 12 22 32 42 11 21 Fault Tolerance • 2D Torus topology • more routing options • XY routing algorithm • Node 33 fails (3) • Nodes on 33’s ringlets becomes unavailable • Cluster fractured with current routing setting
22 24 43 13 23 42 12 34 41 11 21 31 44 14 32 Fault Tolerance • Rerouting with XY • Failed node Logically remapped to a corner • End-point ID’s unchanged • Applications can continue • Problem: • To many working nodes unused 33
22 24 43 13 23 42 12 34 41 11 21 31 44 14 32 Fault Tolerance • Scali advanced routing algorithm: • From the Turn Model family of routing algorithms • All nodes but the failed one can be utilised as one big partition 33
Software Configuration Management Nodes are categorised once,from then on, new software is installed by one mouse Click, or with a single command.
Products (1) • Platforms • Intel Ia32/Linux • Intel Ia32/Solaris • Alpha/Linux • SPARC/Solaris • Ia64/Linux • Middleware • MPI 1.1 • MPI 2 • IP • SAN • VIA • Cray shmem
Products (2) • ”TeraRack” Pentium • Each Rack: • 36 x 1U Units • Dual PIII 800MHz • 57.6GFlops • 144GBytes SDRAM • 8.1TBytes Disk • Power Switches • Console Routers • 2-D Torus SCI