230 likes | 708 Views
SGI Altix ICE Architecture Rev. 1.2a Kevin Nolte, SGI, Professional Services. Altix ICE 8400. Altix ICE 8400 Rack: 42U rack (30” W x 40” D) 4 blade enclosures, each up to 16 two-socket nodes Single- or dual-plane IB 4 interconnect Minimal switch topology scales to 1000s of nodes. Rack.
E N D
SGI Altix ICE ArchitectureRev. 1.2aKevin Nolte, SGI, Professional Services
Altix ICE 8400 • Altix ICE 8400 Rack: • 42U rack (30” W x 40” D) • 4 bladeenclosures, each up to 16 two-socket nodes • Single- or dual-plane IB 4 interconnect • Minimal switch topology scales to 1000s of nodes Rack Blade SGI®Altix® ICE Compute Blade Up to two 4-core sockets, 96GB, 2-IB
Overview of IRU • The basic building block is an 18U-high IRU that contains the following: • Sixteen IP93 compute blades • Four network extender blades • One or two CMC blades • Six 2837 watt 12V power supplies and two 2837 watt 48V power supplies • IRU = Individual Rack Unit • CMC = Chassis Management Controller
Terminology: Socket/Processor Sockets socket = processor = node
SGI Altix ICE Application Environment PrimerRev. 1.2aKen Taylor, SGI, Professional Services
Agenda • Application porting • Code optimization • Programming environment and libraries • Pinning for OpenMP and MPI • SGI-provided software tools
Application Porting • Intel Xeon X5690 – x86_64 • 64-bit compiler and lib64 • -g –traceback –fpe0 (sets -ftz) • Data Representation: Little ENDIAN • -convert big_endian|ibm • env F_UFMTENDIAN=bigenv FORT_CONVERTn=big_endian • OPEN (UNIT=n, CONVERT=…) • Conversion performance impact
Application Porting • Basic I/O Architecture Considerations • No local disk drive (NFS and Lustre FS) • /tmp is tmpfs 150 MB • Torque standard out and err in /var/spool 2 GB
Application Porting • Fortran I/O • Fortran record length • 4 Byte unit • -assume byterecl • Fortran standard portable RECL specification using INQUIRE statement INQUIRE (IOLENGTH=iol) I, A, B, J
Code Optimization • Compute • I/O • Communication
Code Optimization • Key Parallel Programming Models • MPI-2.2 Standard • OpenMP 3.1 Standard • New Parallel Programming Models • SGI UPC • Fortran 2008 Coarrays (Intel ifort 12.1)
Code Optimization • Code Vectorization • Intel SIMD • -xSSE4.2 (Westmere-EP processor) • -opt-report=3
Code Optimization • I/O • Well-formed I/O • Lustre File System • Big I/O striping – lfs setstripe • Lustre caching and direct I/O • MPI I/O Lustre accelerator (SGI, Intel, MVAPICH2) • NFS • Better for small, random I/O (e.g. code compilations) • Parallel I/O issues • Shared file • Read all versus read one then broadcast
Code Optimization • I/O • Intel Fortran I/O Library • FORT_BUFFERED, FORT_BLOCKSIZE, FORT_BUFFERCOUNT • Disable for small, random I/O • Fortran 2003 ASYNCHRONOUS='YES‘ • Linux • Linux Pagecache Scaling • cached too large • Direct I/O • st_blksize (stat command)
Code Optimization • Communication • SGI MPT • MPI_BUFS_PER_PROC • MPI_STATS • MPInside 3.5.4 • MPI_BUFFER_MAX (single-copy) • MPI_IB_RAILS 2 • MPI_COLL_ • MPI_FASTSTART • IB Failover • MPI_IB_RAILS 2|1+
Code Optimization • Communication • SGI MPT Always Set • MPI_VERBOSE • MPI_DISPLAY_SETTINGS • MPI_DSM_VERBOSE
Programming Environment and Libraries • Module environment • Csh: source /usr/share/Modules/init/csh • Bash: . /usr/share/Modules/init/bash (module purge) module load modules # RHEL error module avail (prefix) module load mpt-2.06 module load intel-fc intel-cc intel-mkl
Programming Environment and Libraries • SGI Libraries • SGI MPI 1.4 • SGI MPT 2.05 • SGI perfboost,perfcatcher,test • SGI omplace • SGI MPInside • SGI PerfSuite • SGI FFIO • Upcoming MPT 2.06 IB fail-over fixes and others
SGI Provided Software Tools • SGI Tools • SGI perfboost,perfcatcher,test • SGI omplace • SGI MPInside • SGI PerfSuite • SGI FFIO • NUMA Tools • cpumap, dplace, dlook • Linux /sys/devices/system
Pinning for OpenMP and MPI: SGI MPT • Placement Control for Mix of MPI and OpenMP • MPI_OPENMP_INTEROP • Preferred SGI MPT Method: mpirun –np ranks omplace [OPTIONS] program args • [OPTIONS] • -b basecpu: base cpu to begin allocating threads [default 0]. Relative to current cpuset • -c cpulist: defines effective cpulist • -nt threads: Defines the number of threads per MPI process [defaults to 1 or OMP_NUM_THREADS] • -vv: shows created dplace placement file • Distribute evenly between processors and LLC • Check topology
Pinning for OpenMP and MPI: SGI MPT % mpirun -np 2 omplace -nt 4 -vv ./testmpiomp.x omplace information: MPI type is SGI MPI, 4 threads, thread model is intel placement file /tmp/omplace.file.13498: fork skip=0 exact cpu=0-23:4 thread oncpu=0 cpu=1-3 noplace=1 exact thread oncpu=4 cpu=5-7 noplace=1 exact thread oncpu=8 cpu=9-11 noplace=1 exact thread oncpu=12 cpu=13-15 noplace=1 exact thread oncpu=16 cpu=17-19 noplace=1 exact thread oncpu=20 cpu=21-23 noplace=1 exact MPI: dplace use detected, MPI_DSM_... environment variables ignored rank 0 name cam rank 1 name cam rank 0 np 2 nt 4 thread 0 i 1 cpu 0 rank 0 np 2 nt 4 thread 3 i 4 cpu 3 rank 0 np 2 nt 4 thread 1 i 2 cpu 1 rank 0 np 2 nt 4 thread 2 i 3 cpu 2 rank 1 np 2 nt 4 thread 0 i 1 cpu 4 rank 1 np 2 nt 4 thread 2 i 3 cpu 6 rank 1 np 2 nt 4 thread 3 i 4 cpu 7 rank 1 np 2 nt 4 thread 1 i 2 cpu 5
Pinning for OpenMP and MPI: Intel MPI • Placement Control for Mix of MPI and OpenMP • Intel MPI and Intel OpenMP abstract specifications % mpirun-genv I_MPI_PIN_DOMAIN=cache -np 2 ./testmpiomp-impi.x rank 0 name cam rank 1 name cam rank 0 np 2 nt 4 thread 0 i 1 cpu 17 rank 0 np 2 nt 4 thread 3 i 4 cpu 16 rank 0 np 2 nt 4 thread 1 i 2 cpu 14 rank 0 np 2 nt 4 thread 2 i 3 cpu 15 rank 1 np 2 nt 4 thread 0 i 1 cpu 23 rank 1 np 2 nt 4 thread 2 i 3 cpu 21 rank 1 np 2 nt 4 thread 1 i 2 cpu 20 rank 1 np 2 nt 4 thread 3 i 4 cpu22 • Add KMP_AFFINITY for Intel OpenMP thread placements and pinning