Commodity Computing Clusters - next generation supercomputers?

Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A. pawel.pisarczyk@atm.com.pl

Agenda • Introduction • Supercomputer classification • Architecture and implementations • Commodity clusters • Processors • Operating systems • Summary

Supercomputer • „A supercomputer is a device for turning compute-bound problems into I/O-bound problem” - Seymour Cray • A supercomputer is a computer system that leads the world in terms of processing capacity, particularly speed of calculations, at the time of its introduction. source: http://en.wikipedia.org

Supercomputer History (1) • 1945-50 - Manchester Mark I • 1950-55 - MIT Whirlwind • 1955-60 - IBM 7090 - 210 KFLOPS • 1960-65 - CDC 6600 -10.24 MFLOPS • 1965-70 - CDC 7600 - 32.27 MFLOPS • 1970-75 - CDC Cyber 76

Supercomputer History (2) • 1975-80 - Cray-1 - 160 MFLOPS • 1980-85 - Cray X-MP - 500 MFLOPS • 1985-90 - Cray Y-MP - 1.3 GFLOPS • 1990-95 - Fujitsu Numerical Wind Tunnel - 236 GFLOPS • 1995-00 - Intel ASCI Red - 2.150 TFLOPS • 2000-02 - IBM ASCI White, SP Power3 375 MHz - 7.226 TFLOPS • 2002-03 - NEC Earth Simulator - 35 TFLOPS

Supercomputer Classes (1) • General-purpose supercomputers: • vector processing machines - the same operation carried out on a large amount of data simultaneously • tightly connected cluster computers (NUMA) - communication oriented architectures engineered from ground up, based on high speed interconnects and large number of processors • commodity clusters - collection of large number of commodity PCs (COTS) interconnected by high-bandwidth low-latency network

Supercomputer Classes (2) • Special-purpose supercomputers - high performance computing devices with a hardware architecture dedicated to solve a single problem (equipped with custom ASICS or FPGA chips) Examples • Deep Blue • GRAPE for astrophysics

Flynn taxonomy - 1972 (1) • SISD - Single Instruction Single Data (DEC, Sun Microsystems, PC) • SIMD - Single Instruction Multiple Data • computers with large number o processing units (i.e. ALUs) - CPP DAP Gamma II, Quadrics Apemille • vector processing machines - NEC SX6, IA32 MMX • MISD - Multiple Instruction Single Data • theoretical model, no practical implementation

Flynn taxonomy - 1972 (2) • MIMD - Multiple Instruction Multiple Data • SM-MIMD - Shared Memory MIMD • global address space • SMP systems and ccNUMA systems • DM-MIMD - Distributed Memory MIMD • many nodes with local address spaces • high-bandwidth, low-latency communication • common NUMA architectures (Non Uniform Memory Access) • operating system have to be communication oriented (Mach project)

SM-MIMD implementations • S-COMA - Simple Cache-Only Memory Architecture • common SMP systems • ccNUMA - Cache Coherent NUMA • SGI Origin 3000 • SGI Altix 3000 • HP SuperDome

S-COMA (SMP) RAM L2 cache L2 cache L2 cache CPU 0 CPU 1 CPU N

RAM K L3 cache L2 cache L2 cache CPU N-1 CPU N ccNUMA RAM 0 L3 cache L2 cache L2 cache CPU 0 CPU 1

ccNUMA implementation SGI Altix 3000 (ccNUMA) • 64 Itanium 2 (IA64) processors • C-brick modules with 2 CPUs and ASIC SHUB • NUMAflex, NUMAlink interconnects (6.4 GB/s, 2.4 GB/s) • Modified Linux kernel (2.6 NUMA support)

DM-MIMD implementations • Massively parallel systems (NUMA) • communication oriented architecture • low-latency, high-bandwidth interconnects • topologies: hypercube, torus, tree • Butterfly networks, Omega networks, engineered from ground up communication

DM-MIMD implementations • Commodity clusters • a cluster is a collection of connected, independent computers working in unison to solve a problem • COTS technology • nodes are interconnected by Ethernet LAN, Myrinet, QsNet ELAN etc. • computation can be performed by using popular programming toolkits and frameworks: OpenMP, MPI • clusters require dedicated management software

NUMA implementations Cray T3E-1350 • Processor: Alpha 21164 675 MHz • Number of CPUs: 40 - 2176 • 3-D Torus topology • Operating system: UNICOS/mk - microkernel based • Peak performance: 3 TFLOPS

Commodity cluster implementation (1) Linux Networx/Quadrics • Processor: Intel Xeon 2.4 GHz • CPUs: 2304 • Interconnections: QsNet ELAN3 • Operating system: Linux + management tools + Lustre Cluster File System • Peak performance: 7.6 TFLOPS • 3rd computer on TOP500 list • Developed for Lawrence Livermore National Laboratory in 2002

Commodity cluster implementation (2) HP XC6000 Cluster (XC3000 Cluster) • Processor: Intel Itanium 2 6M 1.5 GHz (Intel Xeon 3 GHz) • Node: HP Integrity rx2600 (HP ProLiant DL380) • Number of processors: 34-512 • Interconnections: QsNet ELAN3 (Myricom Myrinet XP) • Operating system: Linux + SSI Middleware + management tools + Lustre Cluster File System • Peak performance: 34 CPUs - 204 GFLOPS, 512 CPUs - 3 TFLOPS

Commodity Clusters - software • Operating system - Linux or SSI Linux (Single System Image) • Platform for specialized applications for science, engineering and business (simulation, modeling, data mining) • Distributed computation environments are used for software development (OpenMP, MPI) • Common supercomputer applications require porting to clusters

Performance Scaling Scale Right Scale-Up (SMP, ccNUMA) Scale-Out (Cluster)

Processors (1) • Many types of existing processors are used in supercomputers • Microprocessor development directions: • Increasing of clock frequency and speed instruction stream processing • Processing of large collection of data in single processor instruction - SIMD • Control path multiplication – multithreading

Processors (2) • Vector processors • NEC SX-6 • Cray (Cray X1) • RISC processors • MIPS • IBM Power4 • Alpha • CISC processors • IA32 • AMD x86-64 • VLIW processors • IA64

Intel Itanium 2 features • State-of-the-art unconventional 64-bit architecture • New programming model implementing VLIW paradigm • EPIC technology – Explicitly Parallel Instruction Computing – compiler determines instruction dependency informing processor how to process an instruction stream parallel • Many registers (128 64-bit), register stack management • 6 GFLOPS peak performance • Full advantages of the processor can be used by dedicated compiler

Operating systems • Monolithic kernel based OSs - UNIX (modification of existing solutions) • BSD • Solaris • Irix • Linux • Microkernel based OSs • Mach

Microkernel architecture Task A Task B Task C Kernel Kernel Hardware Hardware

Summary • Today’s there is a lot of supercomputer architectures • Both vector processors and common RISC, CISC, VLIW chips are used for supercomputers • Commodity clusters under control of Linux OS are an attractive method for supercomputer implementation

TOP 500 list (1) 1. Earth Simulator, NEC - 35.86 TFLOPS 2. HP Alphaserver SC, HP - 13.88 TFLOPS 3. Linux Networx / Quadrics IA32 - 7.634 TFLOPS

Source: http://www.top500.org/list/2003/06/ Top 500 list (2)

Commodity Computing Clusters - next generation supercomputers?

Commodity Computing Clusters - next generation supercomputers?

Presentation Transcript

CE 350 Introduction to Transportation Planning

Leadership

The Millennial Generation: The Next Generation in College Enrollment

Consonant Clusters (r, l, s)

Next Generation Secure Computing Base

Order and Chaos

Service Computing – Grid Resource Management

Natural Language Generation and Data-To-Text

Cloud Computing: A New Generation of Technology Enables Deeper Collaboration

Quantum Computing

The Millennial Generation: The Next Generation in College Enrollment

Tools for High Performance Scientific Computing

Computing Engine Choices

High Performance Linux Clusters

Parallel computing

High Energy Astrophysics

网格计算与云计算

Interactive Simulation and Visualization in Medicine

Disaster-Tolerant OpenVMS Clusters Keith Parris

Hello Computing (KS3)

云计算与云数据管理

Two Research Methods in Design Computing