Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Anne Weill-Zrahia With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Mar 2005

Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI

Academic Press 2001 ISBN 1-55860-671-8

Parallelization Concepts

Introduction to Parallel Computing • Parallel computer :A set of processors that work cooperatively to solve a computational problem. • Distributed computing : a number of processors communicating over a network • Metacomputing : Use of several parallel computers

Parallel classification • Parallel architectures Shared Memory / Distributed Memory • Programming paradigms Data parallel / Message passing

Why parallel computing • Single processor performance – limited by physics • Multiple processors – break down problem into simple tasks or domains • Plus – obtain same results as in sequential program, faster. • Minus – need to rewrite code

Three HPC Architectures Shared memory Cluster Vector Processor

Shared Memory • Each processor can access any part of the memory • Access times are uniform (in principle) • Easier to program (no explicit message passing) • Bottleneck when several tasks access same location

Symmetric Multiple Processors Memory Memory Bus CPU CPU CPU CPU Examples: SGI Power Challenge, Cray J90/T90

Data-parallel programming • Single program defining operations • Single memory • Loosely synchronous (completion of loop) • Parallel operations on array elements

Distributed Parallel Computing Memory Memory Memory Memory CPU CPU CPU CPU Examples: SP2, Beowulf clusters

Message Passing Programming • Separate program on each processor • Local Memory • Control over distribution and transfer of data • Additional complexity of debugging due to communications

Distributed Memory • Processor can only access local memory • Access times depend on location • Processors must communicate via explicit message passing

Message Passing or Shared Memory? Message Passing Shared Memory Takes longer to implement More details to worry about Increases source lines Complex to debug and time Increase in total memory used Scalability limited by: - communications overhead - process synchronization Parallelism is visible Easier to implement System handles many details Little increase in source Easier to debug and time Efficient memory use Scalability limited by: - serial portion of code - process synchronization Compiler based parallelism

Performance issues • Concurrency – ability to perform actions simultaneously • Scalability – performance is not impaired by increasing number of processors • Locality – high ration of local memory accesses/remote memory accesses (or low communication)

Objectives of HPC in the Technion • Maintain leading position in science/engineering • Production: sophisticated calculations • Required: high speed • Required: large memory • Teach techniques of parallel computing • In research projects • As part of courses

HPC in the Technion SGI Origin2000 22 cpu (R10000) -- 250 MHz Total memory -- 9 GB 32 cpu (R12000) – 300 MHz Total memory - 9GB PC cluster (linux redhat 9.0) 6 cpu (pentium II - 866MHz) Memory - 500 MB/cpu PC cluster (linux redhat 9.0) 16 cpu (pentium III – 800 MHz) Memory – 500 MB/cpu

Origin2000 (SGI) 128 processors

Origin2000 (SGI) 22 processors

PC clusters (Intel) • 6 processors • 16 processors

~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids forHigh Energy Physics Image courtesy Harvey Newman, Caltech

GRIDS: Globus Toolkit • Grid Security Infrastructure (GSI) • Globus Resource Allocation Manager (GRAM) • Monitoring and Discovery Service (MDS): • Global Access to Secondary Storage (GASS):

November 2004

A Recent Example Matrix multiply

Profile -- original

Profile – optimized code

Parallel Programming on the SGI Origin2000