180 likes | 315 Views
Seminar on parallel computing. Goal: provide environment for exploration of parallel computing Driven by participants Weekly hour for discussion, show & tell Focus primarily on distributed memory computing on linux PC clusters Target audience: Experience with linux computing & Fortran/C
E N D
Seminar on parallel computing • Goal: provide environment for exploration of parallel computing • Driven by participants • Weekly hour for discussion, show & tell • Focus primarily on distributed memory computing on linux PC clusters • Target audience: • Experience with linux computing & Fortran/C • Requires parallel computing for own studies • 1 credit possible for completion of ‘proportional’ project
Main idea • Distribute a job over multiple processing units • Do bigger jobs than is possible on single machines • Solve bigger problems faster • Resources: e.g., www-jics.cs.utk.edu
Sequential limits • Moore’s law • Clock speed physically limited • Speed of light • Miniaturization; dissipation; quantum effects • Memory addressing • 32 bit words in PCs: 4 Gbyte RAM max.
Machine architecture: serial • Single processor • Hierarchical memory: • Small number of registers on CPU • Cache (L1/L2) • RAM • Disk (swap space) • Operations require multiple steps • Fetch two floating point numbers from main memory • Add and store • Put back into main memory
Vector processing • Speed up single instructions on vectors • E.g., while adding two floating point numbers fetch two new ones from main memory • Pushing vectors through the pipeline • Useful in particular for long vectors • Requires good memory control: • Bigger cache is better • Common on most modern CPUs • Implemented in both hardware and software
SIMD • Same instruction works simultaneously on different data sets • Extension of vector computing • Example: DO IN PARALLEL for i=1,n x(i) = a(i)*b(i) end DONE PARALLEL
MIMD • Multiple instruction, multiple data • Most flexible, encompasses SIMD/serial. • Often best for ‘coarse grained’ parallelism • Message passing • Example: domain decomposition • Divide computational grid in equal chunks • Work on each domain with one CPU • Communicate boundary values when necessary
Historical machines • 1976 Cray-1 at Los Alamos (vector) • 1980s Control Data Cyber 205 (vector) • 1980s Cray XMP • 4 coupled Cray-1s • 1985 Thinking Machines Connection Machine • SIMD, up to 64k processors • 1984+ Nec/Fujitsu/Hitachi • Automatic vectorization
Sun and SGI (90s) • Scaling between desktops and compute servers • Use of both vectorization and large scale parallelization • RISC processors • Sparc for Sun • MIPS for SGI: PowerChallenge/Origin
Happy developments • High performance Fortran / Fortran 90 • Definitions for message passing languages • PVM • MPI • Linux • Performance increase of commodity CPUs • Combination leads to affordable cluster computing
Who’s the biggest • www.top500.org • Linpack matrix-vector benchmarks • June 2003: • Earth Simulator, Yokohama, NEC, 36 Tflops • Asci Q, Los Alamos, HP, 14 Tflops • Linux cluster, Livermore, 8 Tflops
Parallel approaches • Embarrassingly parallel • “Monte Carlo” searches • SETI @ home • Analyze lots of small time series • Parallalize DO-loops in dominantly serial code • Domain decomposition • Fully parallel • Requires complete rewrite/rethinking
Example: seismic wave propagation • 3D spherical wave propagation modeled with high order finite element technique (Komatitsch and Tromp, GJI, 2002) • Massively parallel computation on linux PC clusters • Approx. 34 Gbyte RAM needed for 10 km average resolution • www.geo.lsa.umich.edu/~keken/waves
Resolution • Spectral elements: 10 km average resolution • 4th order interpolation functions • Reasonable graphics resolution: 10 km or better • 12 km: 10243 = 1 GB • 6 km: 20483 = 8 GB
Simulated EQ (d=15 km) after 17minutes 512x512 256 colors Positive only Truncated max Log10 scale Particle velocity P PPP PP PKPab SK PKP PKIKP
512x512 256 colors Positive only Truncated max Log10 scale Particle velocity Some S component PcSS SS R S PcS PKS
Resources at UM • Various linux clusters in Geology • Agassiz (Ehlers) 8 Pentium 4 @ 2 Gbyte each • Panoramix (van Keken) 10 P3 @ 512 Gbyte • Trans (van Keken, Ehlers) 24 P4 @ 2 Gbyte • SGIs • Origin 2000 (Stixrude, Lithgow, van Keken) • Center for Advanced Computing @ UM • Athlon clusters (384 nodes @ 1 Gbyte each) • Opteron cluster (to be installed) • NPACI
Software resources • GNU and Intel compilers • Fortran/Fortran 90/C/C++ • MPICH www-fp.mcs.anl.gov • Primary implementation of MPI • “Using MPI” 2nd edition, Gropp et al., 1999 • Sun Grid Engine • Petsc www-fp.mcs.anl.gov • Toolbox for parallel scientific computing