350 likes | 489 Views
Parallel Programming on the SGI Origin2000. Taub Computer Center Technion. Anne Weill-Zrahia. With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI. Mar 2005. Parallel Programming on the SGI Origin2000. Parallelization Concepts SGI Computer Design Efficient Scalar Design
E N D
Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Anne Weill-Zrahia With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Mar 2005
Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI
Academic Press 2001 ISBN 1-55860-671-8
Introduction to Parallel Computing • Parallel computer :A set of processors that work cooperatively to solve a computational problem. • Distributed computing : a number of processors communicating over a network • Metacomputing : Use of several parallel computers
Parallel classification • Parallel architectures Shared Memory / Distributed Memory • Programming paradigms Data parallel / Message passing
Why parallel computing • Single processor performance – limited by physics • Multiple processors – break down problem into simple tasks or domains • Plus – obtain same results as in sequential program, faster. • Minus – need to rewrite code
Three HPC Architectures Shared memory Cluster Vector Processor
Shared Memory • Each processor can access any part of the memory • Access times are uniform (in principle) • Easier to program (no explicit message passing) • Bottleneck when several tasks access same location
Symmetric Multiple Processors Memory Memory Bus CPU CPU CPU CPU Examples: SGI Power Challenge, Cray J90/T90
Data-parallel programming • Single program defining operations • Single memory • Loosely synchronous (completion of loop) • Parallel operations on array elements
Distributed Parallel Computing Memory Memory Memory Memory CPU CPU CPU CPU Examples: SP2, Beowulf clusters
Message Passing Programming • Separate program on each processor • Local Memory • Control over distribution and transfer of data • Additional complexity of debugging due to communications
Distributed Memory • Processor can only access local memory • Access times depend on location • Processors must communicate via explicit message passing
Message Passing or Shared Memory? Message Passing Shared Memory Takes longer to implement More details to worry about Increases source lines Complex to debug and time Increase in total memory used Scalability limited by: - communications overhead - process synchronization Parallelism is visible Easier to implement System handles many details Little increase in source Easier to debug and time Efficient memory use Scalability limited by: - serial portion of code - process synchronization Compiler based parallelism
Performance issues • Concurrency – ability to perform actions simultaneously • Scalability – performance is not impaired by increasing number of processors • Locality – high ration of local memory accesses/remote memory accesses (or low communication)
Objectives of HPC in the Technion • Maintain leading position in science/engineering • Production: sophisticated calculations • Required: high speed • Required: large memory • Teach techniques of parallel computing • In research projects • As part of courses
HPC in the Technion SGI Origin2000 22 cpu (R10000) -- 250 MHz Total memory -- 9 GB 32 cpu (R12000) – 300 MHz Total memory - 9GB PC cluster (linux redhat 9.0) 6 cpu (pentium II - 866MHz) Memory - 500 MB/cpu PC cluster (linux redhat 9.0) 16 cpu (pentium III – 800 MHz) Memory – 500 MB/cpu
Origin2000 (SGI) 128 processors
Origin2000 (SGI) 22 processors
PC clusters (Intel) • 6 processors • 16 processors
~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids forHigh Energy Physics Image courtesy Harvey Newman, Caltech
GRIDS: Globus Toolkit • Grid Security Infrastructure (GSI) • Globus Resource Allocation Manager (GRAM) • Monitoring and Discovery Service (MDS): • Global Access to Secondary Storage (GASS):
A Recent Example Matrix multiply