630 likes | 750 Views
CUG14 BoF : Future Needs for Understanding User– Level Activity with ALTD. ALTD. What it does Intercepts linker ( ln ) and job launcher ( aprun ) Uses linker tracemap option to get all libraries Stores all of this in a database What it gets Full path of the executable
E N D
CUG14 BoF: Future Needs for Understanding User–Level Activity with ALTD
ALTD • What it does • Intercepts linker (ln) and job launcher (aprun) • Uses linker tracemap option to get all libraries • Stores all of this in a database • What it gets • Full path of the executable • Static and dynamic libraries used by the executable • What it can be used for • Which executables use the largest number of core hours? • Are they managed by center? Do they use the system efficiently? • Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install? • What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated enough? • Are there any executables with modification times older than 1 year? • Should we ask the user to recompile? CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
What Does NERSC Collect? • ALTD • Track library usage both at compile and run time • Torque Logs • Job information, accounting • ALPS Logs • Track applications run time data and options on the Cray systems • Darshan • IO profiling data • IPM • MPI profiling data • Performance Monitoring • Monitoring system performance over the life time of the machines • LMT • Lustredata CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
ALTD is enabled on all major computing platforms at NERSC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Applications of ALTD An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a ../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a • Understanding current library usage and plan for future software need • Providing usage statistics to developers and vendors • Restoring the program environment where user applications were built • Assisting with debugging system issues CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
ALTD at CSCS • In production at CSCS since 2011 • Rock solid: just a single downtime in two years • Rosa (Cray XE6) since March 2011 • 600K compilations, 2.8M jobs • Todi (Cray XK6/XK7) since October 2012 • 470K compilations, 500K jobs • Daint (Cray XC30) since March 2013 • 100K compilations, 550K jobs • We’ve added an additional SQL table “accounting” which logs more data about the application execution – number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, … • We want to be able to detect situations like the use of a buggy or non-performant library CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
How we mine data: ahypotheticsituation A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
First, find which users have linked this library mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ; +----------+ | username | +----------+ | tkachenn | | boswald | | liang | | robinson | | yunding | | zilia | +----------+ 5 rows in set (4.33 sec) • Querying the ALTD database reveals that several users have applications linked to the buggy library CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Now, check if they are using the buggy application • And it’s confirmed that user “robinson” is running the application linked to the buggy library • It’s now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson"; +---------+--------+------------------------+----------+------------+--------+---------------+ | run_inc | tag_id | executable | username | run_date | job_id | build_machine | +---------+--------+------------------------+----------+------------+--------+---------------| | 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| +---------+--------+------------------------+----------+------------+--------+---------------| 4 rows in set (0.65 sec) CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
This methodology is clearly unmanageable! • Ideally, user support specialists would be alerted automatically to “situations of interest” • Users running applications linked to legacy, less-performant, or buggy libraries • Users running legacy versions of applications • Users building code with legacy compilers • Users making use of their own libs or apps, when more optimized versions are available centrally How can we automate the processes of data mining, reporting and alerting? CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Lariat (TACC) • What it does • Intercepts job launcher (ibrun) • Uses ldd to get shared libraries • Checks run time environment against compile time environmet • What it gets • Full path of the executable • Dynamic libraries used by the executable • Last modification time of the executable • Size of the executable(e.g. bss, data, text) • Unique hash of the executable • Whether the executable is a binary or a shell script • What it can be used for • Which executables use the largest number of core hours? • Are they managed by TACC? Do they use the system efficiently? • Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install? • What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated enough? • Should we direct these users to our parametric job launcher? • Are there any executables with modification times older than 1 year? • Should we ask the user to recompile? • Are there any executables with large statically allocated arrays(bss)? These can be obtained with ALTD as well with straightforward modifications like CSCS already did CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
TACC_Stats • Job-level transparent performance monitoring from HPC compute nodes • CPU performance counters • IB statistics • Lustre statistics • Scheduler job statistics • Host data • OS statistics • Analyses integrate available Lariat data CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Nightly Analyses • Automatically analyzes jobs nightly • Highlights jobs worth looking at • Tries to provide a one-stop view of a job for • Support staff • Sysadmins • And soon, users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Current Reports • High levels of imbalance • Low Flops (but other activity) • Idle hosts • Catastrophic performance drop CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
XALT: Understanding the Software Needs of High End Computer Users • Newly NSF funded project • Will be combining the best of Lariat and ALTD • Collecting job-level and link-time level data and subsequent analytics • Building a community around analytics – potentially one of many tools • Will make it available to the community • Optional interface to XDMod/SUPREMME CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
XALT Goals • Goal is a census of libraries and applications and automatic filtering of user issues • what additional user problems can we detect and report (perhaps correct) automatically? • How can we leverage lessons learned by the tacc stats team to implement additional automatic filtering? • Plan to add tracking of function calls as well • Want to balance the need for portability with support for site-specific capabilities • Want to simplify the processes system administrators use to install, configure, and manage CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Mark’s [not so hidden] agenda • Do you know what libraries are being used? Can you help a user figure out what he did X months ago? Do you know how many users have trouble with runtime environment matching compile time? • Would you have strong opposition to intercepting the linker "ld"? • Anyone willing to be a beta tester for our before and after study? • Do you have any issues with dropping dot files in user home directories? • Do you want to track library function calls? • xalt-users@lists.sourceforge.net • Want feedback, hungry for ideas CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Thanks to • Richard Gerber and Zhengji Zhao, NERSC • Tim Robinson, CSCS • Bill Barth, TACC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Contact Info • Mark R. Fahey • mfahey@utk.edu • Robert McLay • mclay@tacc.utexas.edu CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Background Slides CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Robert McLayTACC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
My Passions • Protect new user but stay out of vet's way • Make staff support efficient and effective • Automate detection, correction, prevention • Make the repeat tickets go away! CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Making a difference… Maintain consistent, compatible software environment Lmod and related tools $ module swap mvapich2 impi Inactive Modules: 1) vasp Due to MODULEPATH changes the following have been reloaded: 1) fftw3/3.3.2 $ module load mvapich2 Lmod Error: You can only have one MPI module loaded at a time. You already have impi loaded. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Making a difference… Detect potential problems and alert users Lariat and related tools TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Making a difference… Job-level usage data on libraries and applications ALTD (Mark Fahey -- NICS) CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Joining forces… TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** Job-level usage data on libraries and applications Detect potential problems and alert users XALT ALTD Lariat CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
My own not-so-hidden agenda... • Looking for XALT beta users • Hungry for ideas, needs, feedback • Wanting to begin conversation with kindred souls CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Lariat User #1 Bill Barth Director of HPC, TACC Co-PI SUPreMM CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
TACC Stats • Job-level transparent performance monitoring from HPC compute nodes • CPU performance counters • IB statistics • Lustre statistics • Scheduler job statistics • Host data • OS statistics • Analyses integrate available Lariat data CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Nightly Analyses • Automatically analyzes jobs nightly • Highlights jobs worth looking at • Tries to provide a one-stop view of a job for • Support staff • Sysadmins • And soon, users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Current Reports • High levels of imbalance • Low Flops (but other activity) • Idle hosts • Catastrophic performance drop CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
NERSC Job Data Richard GerberZhengji ZhaoNERSC User Services CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
What Does NERSC Collect? • ALTD • Track library usage both at compile and run time • Torque Logs • Job information, accounting • ALPS Logs • Track applications run time data and options on the Cray systems • Darshan • IO profiling data • IPM • MPI profiling data • Performance Monitoring • Monitoring system performance over the life time of the machines • LMT • Lustredata CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Expose job data via the web • We try to make as much data available as possible via the web • For users to track usage • For users to check resource utilization • For users to monitor performance • For staff to help debug jobs • For summary reports • The following are web screen shots • All data collection is transparent to users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
ALTD is enabled on all major computing platforms at NERSC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Applications of ALTD An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a ../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a • Understanding current library usage and plan for future software need • Providing usage statistics to developers and vendors • Restoring the program environment where user applications were built • Assisting with debugging system issues CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD
Monitoring Software Usage at CSCS Dr Tim Robinson CSCS Drilling Down: Understanding User-Level Activity on Today’s Supercomputers