260 likes | 275 Views
The Triton Resource at UCSD is a medium-scale high performance computing and data storage system designed to serve the needs of researchers. It offers turn-key, cost-competitive access to a robust computing resource for computing research, scientific and engineering computing, and large-scale data analysis. It eliminates the need for lengthy proposals and long waits for access, and supports short or long-term projects with flexible usage models. Triton Resource components include Data Oasis, a high performance file system, Triton Compute Cluster, and Petascale Data Analysis Facility. Users can choose shared-queue access, dedicated compute nodes, or a hybrid model for their computing tasks. Triton Resource offers benefits such as short lead time for project start-up, low waits-in-queue, and access to HPC experts for setup and troubleshooting. Affiliates and partners can access the resource through the Triton Affiliates & Partners Program (TAPP). Numerical libraries available on Triton include AMD Core Math Library (ACML) and Intel Math Kernel Libraries (MKL).
E N D
Overview of UCSD’s Triton Resource A cost-effective, high performance shared resource for research computing
What is the Triton Resource? • A medium-scale high performance computing (HPC) and data storage system • Designed to serve the needs of UC researchers: • Turn-key, cost competitive access to a robust computing resource • Supports computing research, scientific & engineering computing, large scale data analysis • Lengthy proposals & long waits for access are not required • Support short- or long-term projects • Flexible usage models are accommodated • Free of equipment headaches and staffing costs associated with maintaining a dedicated cluster
Triton Resource Components Data Oasis: 2,000 – 4,000 terabytes of disk storage for research data High Performance File System Triton Compute Cluster (TCC): Medium-scale cluster system for general purpose HPC. 256 nodes, 24GB of memory, 2 quad core Nehalem processors/node (8 cores/node). Petascale Data Analysis Facility (PDAF): Unique SMP system for analyzing very large datasets. 28 nodes, 256/512GB of memory, 8 quad core AMD Shanghai processors/node (32 cores/node). High Performance Network To High Bandwidth Research Networks & Internet
Flexible Usage Models • Shared-queue access • Compute nodes are shared with other users • Jobs are submitted to queue and wait to run • Batch and interactive jobs are supported • User accounts are debited by actual service units consumed by the job • Dedicated compute nodes • User can reserve a fixed number of compute nodes for exclusive access • User is charged for 24x7 use of the nodes at 70% utilization • Any utilization over 70% is a “bonus” • Nodes may be reserved on a monthly basis • Hybrid • Dedicated nodes for core computing tasks and shared-queue access for overflow or jobs that are not time-critical or jobs requiring higher core counts
Triton Resource Benefits • Short lead time for project start-up • Low waits-in-queue • No lengthy proposal process • Flexible usage models: • Access to HPC experts for setup, software optimization and trouble-shooting • Avoid using research staff for sysadmin tasks • Avoid headaches with maintenance, aging equipment, project wind-down • Access to parallel high performance, high capacity storage system • Access to high bandwidth research networks
Triton Affiliates & Partners Program • TAPP is SDSC’s program for accessing the Triton Resource • Two components: • Central Campus Purchase • Individual / Department Purchase • Central Campus Purchase • Block purchase made by central campus then allocated out to individual faculty / researchers • Individual Purchase • Faculty / researchers / departments purchase cycles from grants or other funding • Startup Accounts • 1,000 SU accounts for evaluation are granted upon request
Contact for access/allocations: Ron Hawkins TAPP Manager rhawkins@sdsc.edu (858) 534-5045
Numerical Libraries on Triton Mahidhar Tatineni 04/22/2010
AMD Core Math Library (ACML) • Installed on Triton as part of the PGI compiler installation directory. • Covers BLAS, LAPACK, and FFT routines. • ACML user guide is in the following location: /opt/pgi/linux86-64/8.0-6/doc/acml.pdf • Example BLAS, LAPACK, FFT codes in: /home/diag/examples/ACML
BLAS Example Using ACML • Compile and link as follows: pgcc -L/opt/pgi/linux86-64/8.0-6/lib blas_cdotu.c -lacml -lm -lpgftnrtl –lrt • Output: -bash-3.2$ ./a.out ACML example: dot product of two complex vectors using cdotu ------------------------------------------------------------ Vector x: ( 1.0000, 2.0000) ( 2.0000, 1.0000) ( 1.0000, 3.0000) Vector y: ( 3.0000, 1.0000) ( 1.0000, 4.0000) ( 1.0000, 2.0000) r = x.y = ( -6.000, 21.000)
Lapack Example Using ACML • Compile and link as follows: pgcc -L/opt/pgi/linux86-64/8.0-6/lib lapack_dgesdd.c -lacml -lm -lpgftnrtl –lrt • Output: -bash-3.2$ ./a.out ACML example: SVD of a matrix A using dgesdd -------------------------------------------- Matrix A: -0.5700 -1.2800 -0.3900 0.2500 -1.9300 1.0800 -0.3100 -2.1400 2.3000 0.2400 0.4000 -0.3500 -1.9300 0.6400 -0.6600 0.0800 Singular values of matrix A: 3.9147 2.2959 1.1184 0.3237
FFT Example Using ACML • Compile and link as follows: pgf90 dzfft_example.f -L/opt/pgi/linux86-64/8.0-6/lib –lacml • Output: -bash-3.2$ ./a.out ACML example: FFT of a real sequence using ZFFT1D -------------------------------------------------- Components of discrete Fourier transform: 1 2.4836 2 -0.2660 3 -0.2577 4 -0.2564 5 0.0581 6 0.2030 7 0.5309 Original sequence as restored by inverse transform: Original Restored 1 0.3491 0.3491 2 0.5489 0.5489 3 0.7478 0.7478 4 0.9446 0.9446 5 1.1385 1.1385 6 1.3285 1.3285 7 1.5137 1.5137
Intel Math Kernel Libraries (MKL) • Installed on Triton as part of the Intel compiler directory. • Covers BLAS, LAPACK, FFT, BLACS, and SCALAPACK libraries. • Most useful link: The Intel link advisor! http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ • Examples in the following directory: • /home/diag/examples/MKL
CBLAS example using MKL • Compile as follows: > export MKLPATH=/opt/intel/Compiler/11.1/046/mkl > icc cblas_cdotu_subx.c common_func.c -I$MKLPATH/include $MKLPATH/lib/em64t/libmkl_solver_lp64_sequential.a -Wl,--start-group $MKLPATH/lib/em64t/libmkl_intel_lp64.a $MKLPATH/lib/em64t/libmkl_sequential.a $MKLPATH/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread • Run as follows: [mtatineni@login-4-0 MKL]$ ./a.out cblas_cdotu_subx.d C B L A S _ C D O T U _ S U B EXAMPLE PROGRAM INPUT DATA N=4 VECTOR X INCX=1 ( 1.00, 1.00) ( 2.00, -1.00) ( 3.00, 1.00) ( 4.00, -1.00) VECTOR Y INCY=1 ( 3.50, 0.00) ( 7.10, 0.00) ( 1.20, 0.00) ( 4.70, 0.00) OUTPUT DATA CDOTU_SUB = ( 40.100, -7.100)
LAPACK example using MKL • Compile as follows: ifort dgebrdx.f -I$MKLPATH/include $MKLPATH/lib/em64t/libmkl_solver_lp64_sequential.a -Wl,--start-group $MKLPATH/lib/em64t/libmkl_intel_lp64.a $MKLPATH/lib/em64t/libmkl_sequential.a $MKLPATH/lib/em64t/libmkl_core.a -Wl,--end-group libaux_em64t_intel.a -lpthread • Output: [mtatineni@login-4-0 MKL]$ ./a.out < dgebrdx.d DGEBRD Example Program Results Diagonal 3.6177 2.4161 -1.9213 -1.4265 Super-diagonal 1.2587 1.5262 -1.1895
ScaLAPACK example using MKL • Sample test case (from MKL examples) is in: • /home/diag/examples/scalapack • The make file is set up to compile all the tests. Procedure: module purge module load intel module load openmpi_mx make libem64t compiler=intel mpi=openmpi LIBdir=/opt/intel/Compiler/11.1/046/mkl/lib/em64t • Sample link line (to illustrate how to link for scalapack): mpif77 -o ../xsdtlu_libem64t_openmpi_intel_noopt_lp64 psdtdriver.o psdtinfo.o psdtlaschk.o psdbmv1.o psbmatgen.o psmatgen.o pmatgeninc.o -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t /opt/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_scalapack_lp64.a /opt/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t /opt/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_intel_lp64.a -Wl,--start-group /opt/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_sequential.a /opt/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread
Profiling Tools on Triton • FPMPI • MPI profiling library: /home/beta/fpmpi/fpmpi-2 (PGI +MPICH MX) • TAU • Profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. Available on Triton, compiled with PGI compilers. /home/beta/tau/2.19-pgi /home/beta/pdt/3.15-pgi
Using FPMPI on Triton • The library is located in: /home/beta/fpmpi/fpmpi-2/lib • Needs PGI and MPICH MX: > module purge > module load pgi > module load mpich_mx • Just relink with the library. For example: /opt/pgi/mpichmx_pgi/bin/mpicc -o cpi cpi.o -L/home/beta/fpmpi/fpmpi-2/lib -lfpmpi
Using FPMPI on Triton • Run code normally: >mpirun -machinefile $PBS_NODEFILE -np 2 ./cpi Process 1 on tcc-2-25.local pi is approximately 3.1416009869231241, Error is 0.0000083333333309 wall clock time = 0.036982 Process 0 on tcc-2-25.local • Creates output file (fpmpi_profile.txt) with profile data. • Check /home/diag/FPMPI directory for more examples.
Sample FPMPI Output • Command: /mirage/mtatineni/TESTS/FPMPI/./cpi • Date: Wed Apr 21 16:44:04 2010 • Processes: 2 • Execute time: 0 • Timing Stats: [seconds] [min/max] [min rank/max rank] • wall-clock: 0 sec 0.000000 / 0.000000 0 / 0 • Memory Usage Stats (RSS) [min/max KB]: 825/926 • Average of sums over all processes • Routine Calls Time Msg Length %Time by message length • 0.........1........1........ • K M • MPI_Bcast : 2 0.00179 4 0*00000000000000000000000000 • MPI_Reduce : 1 0.0252 8 00*0000000000000000000000000
Sample FPMPI Output • Details for each MPI routine • Average of sums over all processes • % by message length • (max over 0.........1........1........ • processes [rank]) K M • MPI_Bcast: • Calls : 2 2 [ 0] 0*00000000000000000000000000 • Time : 0.00179 0.00356 [ 1] 0*00000000000000000000000000 • Data Sent : 4 8 [ 0] • By bin : 1-4 [2,2] [ 5.96e-06, 0.00356] • MPI_Reduce: • Calls : 1 1 [ 0] 00*0000000000000000000000000 • Time : 0.0252 0.027 [ 0] 00*0000000000000000000000000 • Data Sent : 8 8 [ 0] • By bin : 5-8 [1,1] [ 0.0235, 0.027] • Summary of target processes for point-to-point communication: • 1-norm distance of point-to-point with an assumed 2-d topology • (Maximum distance for point-to-point communication from each process) • 0 0 • Detailed partner data: source: dest1 dest2 ... • Size of COMM_WORLD 2 • 0: • 1:
About Tau TAU is a suite of Tuning and Analysis Utilities www.cs.uoregon.edu/research/tau • 11+ year project involving • University of Oregon Performance Research Lab • LANL Advanced Computing Laboratory • Research Centre Julich at ZAM, Germany • Integrated toolkit • Performance instrumentation • Measurement • Analysis • Visualization
Using Tau • Load the papi and tau modules • Gather information for the profile run: • Type of run (profiling/tracing, hardware counters, etc…) • Programming Paradigm (MPI/OMP) • Compiler (Intel/PGI/GCC…) • Select the appropriate TAU_MAKEFILE based on your choices ($TAU/Makefile.*) • Set up the selected PAPI counters in your submission script • Run as usual & analyze using paraprof • You can transfer the database to your own PC to do the analysis
Tau: Example Set up the tau environment (this will be in modules in the next software stack of Triton): export PATH=/home/beta/tau/2.19-pgi/x86_64/bin:$PATH export LD_LIBRARY_PATH=/home/beta/tau/2.19-pgi/x86_64/lib:$LD_LIBRARY_PATH Choose the TAU_MAKEFILE to use for your code. For example: /home/beta/tau/2.19-pgi/x86_64/lib/Makefile.tau-mpi-pdt-pgi So we set it up: % export TAU_MAKEFILE=/home/beta/tau/2.19-pgi/x86_64/lib/Makefile.tau-mpi-pdt-pgi And we compile using the wrapper provided by tau: % tau_cc.sh matmult.c Run the job through the queue normally. Analyze output using paraprof. (More detail in the Ranger part of the presentation).
Coming Soon on Triton • Data Oasis version 0! We have the hardware on site and are working to get the lustre filesystem setup (~350TB). • Upgrade of entire software stack. A lot of the packages in /home/beta will become a permanent part of the stack (we have rocks rolls for them). This will happen within a month. • mpiP will be installed soon on Triton. • PAPI/IPM needs perfctr patch of kernel. Need to integrate this into our stack (Not in the current upgrade).