Introduction to Parallel Computing

Introduction to Parallel Computing

Presentation Outline • Doing science and engineering using HPC • Basic concepts of parallel computing • Discussion of HPC hardware • Programming approaches (HPC software): • Library-based approaches • Language-based approaches • HPC facilities at NIIT

High Performance Computing (HPC) • The prime focus of HPC is performance—the ability to solve biggest possible problems in the least possible time • Also called “Parallel Computing”: • The use of multiple processors, used in parallel, to solve an application • Normally such computing is used to solve challenging scientific problems by doing simulations: • For this reason, it is also called “Scientific Computing”: • Computational science • HPC is a highly specialized area: • Probably our best chance to work for world’s top research and commercial organizations: • NASA, European Agency (ESA) … • Google is known to have immense computational power—the quantity remains unknown!

Doing science and engineering using HPC • HPC is aiding to solve some of the most important problems in science today by pushing software and hardware technology to its limits • Scientific Computing (or computational science) is the field of study concerned with: • Constructing mathematical models and numerical solution techniques • Using computers to analyze and solve scientific and engineering problems • Applications areas: • Computer-aided Engineering • Weather forecast simulations • Animated movies (Hollywood!) • Image processing • Cryptography • Hurricane forecasts: • Path as well intensity (Katrina)

HPC driving science? • The Millennium Simulation: • Computational Astrophysics • Heralded as “the” largest ever model of the Universe • Follows the evolution of ten billion “dark matter” particles • The simulation ran on a supercomputer for almost a month • The Blue Brain Project: • Computational Neuroscience • An effort to simulate the working of a mammalian brain • One of the fastest supercomputers in the world is used for the simulations Arguably these projects cannot be done without HPC

PAM CRASH—A Case Study from Automobile Industry • PAM CRASH is parallel application for studying structural deformation, employed in simulations of automotive crashes and other situations: • An effective alternative to physical crashes, which are expensive and time-consuming • Modern simulations take into account millions of elements: • Such compute-intensive simulations can only be studied on parallel hardware • Automobile giants including Audi, BMW, Volkswagen and others are conducting crash simulations using PAM CRASH

Serial Computation • Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU) • A problem is broken into a discrete series of instructions

Parallel Computation • Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently

Flynn’s Taxonomy • There is no authoritative classification of parallel computers! • Flynn’s taxonomy is one such classification based on number of instruction and data stream processed by a parallel computer: • Single Instruction Single Data (SISD) • Multiple Instruction Single Data (MISD) • Single Instruction Multiple Data (SIMD) • Multiple Instruction Multiple Data (MIMD) • Almost all modern computers fall in this category

Flynn’s Taxonomy • Extensions to Flynn’s taxonomy: • Single Program Multiple Data (SPMD)—a programming model • This classification is largely outdated!

HPC Hardware • Traditionally HPC has adopted expensive parallel hardware: • Massively Parallel Processors (MPP) • Symmetric Multi-Processors (SMP) • Cluster Computers: • A group of PCs connected through a fast (private) network • Other classifications: • Distributed Memory Machines • Shared Memory Machines

Massively Parallel Processors (MPP) • A large parallel processing computer with a shared-nothing approach: • The term signifies that each computer has its own cache and memory • Examples include Cray XT3, T3E, T3D, IBM SP/2

Symmetric Multi-Processors (SMP) • A SMP is a parallel processing system with a shared-everything approach: • The term signifies that each processor shares the main memory and possibly the cache • Typically a SMP can have 2 to 256 processors • Examples include AMD Athlon, AMD Opteron 200 and 2000 series, Intel XEON etc

Cluster Computers • A group of PCs or workstations or Macs (called nodes) connected to each other via a fast (and private) interconnect: • Each node is an independent computer • Each cluster has one head-node and multiple compute-nodes: • Users logon to head-node and start parallel jobs on compute-nodes • Such cluster can be made with Commodity-Off-The-Shelf (COTS) components: • A major breakthrough in HPC was the adoption of commodity clusters: • Economics • Fast interconnects like Myrinet, Infiniband, Quadrics • Two popular cluster classifications: • Beowulf Clusters (http://www.beowulf.org) • Rocks Clusters (http://www.rocksclusters.org)

Memory CPU Cluster Computer Proc 1 Proc 2 Proc 0 message LAN Ethernet Myrinet Infiniband etc Proc 3 Proc 7 Proc 6 Proc 4 Proc 5

Beowulf History • At the most fundamental level, when two or more computers are used together to solve a problem, it is considered a cluster • In 1993, Donald Becker and Thomas Sterling started sketching the details of commodity-based cluster system: • The aim was to come up with a cost-effective alternative to large supercomputers • The initial prototype was a cluster computer consisting of 16 DX4 processors connected by channel bonded Ethernet • The idea was an instant success! • Largely due to economics • Open-source software like Linux, GNU compilers, PVM, and MPI, were a major factor

Thomas Sterling with Naegling, Caltech's Beowulf Cluster

SMP and Multi-core clusters • Most modern commodity clusters have SMP and/or multi-core nodes: • Processors not only communicate via interconnect, but shared memory programming is also required • This trend is likely to continue: • Even a new name “constellations” has been proposed

Distributed Memory • Each processor has its own local memory • Processors communicate with each other via an interconnect

Shared Memory • All processors have access to shared memory: • Notion of “Global Address Space”

Hybrid • Modern clusters have hybrid architecture: • Distributed memory for inter-node (between nodes) communications • Shared memory for intra-node (within a node) communications

The TOP500 • The TOP500 project was started in 1993: • Aim is to provide a reliable basis for tracking and detecting trends in HPC • Twice a year, a list of the sites operating the 500 most powerful computer systems is assembled and released • The best performance on the Linpack benchmark is used as performance measure for ranking the computer systems • The latest list was released at Supercomputing 2006 held at Tampa Florida • The fastest supercomputer is IBM Blue Gene/L at Lawrence Livermore National Lab (LLNL): • Theoretical peak performance: 280.6 TeraFLOPS • Number of Processors: 131072 • Main memory: 32768 GB

The Top 5 • DOE/NNSA/LLNL United States • BlueGene/L - eServer Blue Gene Solution IBM • NNSA/Sandia National Laboratories United States • Red Storm - Sandia/ Cray Red Storm, Opteron 2.4 GHz dual core Cray Inc. • IBM Thomas J. Watson Research Center United States • BGW - eServer Blue Gene Solution IBM • DOE/NNSA/LLNL United States • ASC Purple - eServer pSeries p5 575 1.9 GHz IBM • Barcelona Supercomputing Center Spain • MareNostrum - BladeCenter JS21 Cluster, PPC 970, 2.3 GHz, Myrinet IBM

The Top 100 on Google Maps

Writing Parallel Software • There are mainly two approaches for writing parallel software: • Software that can be executed on parallel hardware to exploit computational and memory resources • The first approach is to use libraries (packages) written in already existing languages like C, Fortran, and Java: • Economical • These libraries provide primitives (methods) like send() and recv() for communicating data • The second and more radical approach is to provide new languages: • HPC has a history of novel parallel languages • These languages provide high level parallelism constructs: • What is a construct?

Library-based Approach • One school of thought is to provide parallelism by providing message passing between processors • Such libraries are based on the idea of supporting parallelism in traditional languages like C and Fortran, • Obvious social advantages • Two popular messaging approaches: • Parallel Virtual Machine (PVM) • Message Passing Interface (MPI) • Other messaging libraries: • Message Passing Toolkit (MPT) • SHared MEMory (SHMEM) … • The Message Passing Interface (MPI) has become a de facto standard for writing HPC applications

Message Passing Interface (MPI) • MPI is a standard (an interface or an API): • It defines a set of methods that are used by application developers to write their applications • MPI library implement these methods • MPI itself is not a library—it is a specification document that is followed! • Reasons for popularity: • Software and hardware vendors were involved • Significant contribution from academia • MPICH served as an early reference implementation • MPI compilers are simply wrappers to widely used C and Fortran compilers • MPI is a success story: • It is the mostly adopted programming paradigm of IBM Blue Gene systems • At least two production-quality MPI libraries: • MPICH2 (http://www-unix.mcs.anl.gov/mpi/mpich2/) • OpenMPI (http://open-mpi.org) • There’s even a Java library: • MPJ Express (http://mpj-express.org)

Language-based Approach • There is a long history of novel parallel programming languages: • The central idea is to support parallelism by providing easy-to-use constructs • Social aspects to HPC languages: • Dialect or superset of existing languages • Completely new HPC languages - an ambitious approach • What happens to legacy code? • Conceptually most HPC languages can be categorized as: • Shared memory languages: • Mainly for programming on shared memory platforms like SMP • Partitioned Global Address Space (PGAS) languages: • Mainly for distributed memory HPC platforms • Distributed memory languages: • Mainly for distributed memory HPC platforms

Shared Memory Languages • Designed to support parallel programming on shared memory platforms: • OpenMP: • Consists of a set of compiler directives, library routines, and environment variables • The runtime uses fork-join model of parallel execution • Cilk: • A design goal was to support asynchronous parallelism • A set of keywords: • cilk, spawn, sync … • POSIX Threads (PThreads)

Partitioned Global Address Space (PGAS) Languages • A PGAS is an abstraction that logically divide a process’ address space into two halves: • Private • Shared • Follow the so-called Distributed Shared Memory (DSM) model • Unified Parallel C (UPC): • We discuss it in detail later • Titanium: • A Java dialect • Co-Array Fortran: • Support for co-arrays

Distributed Memory Languages • These purely DM languages support HPC on distributed memory platforms • High Performance Fortran (HPF): • Data parallelism • An effort to standardize a family of data parallel Fortran languages • Fortran M: • Ensured deterministic execution • Added message passing extensions to Fortran 77 • HPJava: • Motivated by HPF

A Different Aspect Languages based on Directives Languages based on Global Address Space HPF OpenMP UPC CoArray Fortran Titanium Languages based on Library SHMEM GPMEM PVM MPI Languages driven by HPCS Chapel X10 Fortress C Fortran Java Runtime level libraries Language extension Credit: Hong Ong, Oak Ridge National Laboratory

US High Productivity Computing Systems • Aims: • To produce systems that double in productivity and value every 18 months • Decrease time-to-solution: • Development time • Execution time • Research: • In SW and HW technology: • New Programming Languages • Quantifying productivity • Funding stages: • Three vendors are involved: Sun, IBM, and Cray • Three new programming languages: • X10, Chapel, and Fortress

Introduction to Parallel Computing