260 likes | 445 Views
XSEDE14 BoF : Drilling Down: Understanding User– Level Activity on Today’s Supercomputers. Outline. Brief presentation Open discussion Demo. Can you . Accurately say how many users, projects link a particular library into their code? Determine if a library was never used?
E N D
XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Outline • Brief presentation • Open discussion • Demo XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Can you • Accurately say how many users, projects link a particular library into their code? • Determine if a library was never used? • Differentiate user built app usage from center provided app usage? • Determine after the fact which users used a buggy library? • Help a user figure out how they built their code (provenance information)? • Determine trend usage in libraries/compilers? • Catch runtime/compiler time environment differences? • Determine which routines from a math or IO library are used the most? • Identify applications being used older than a certain amount? XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
If not, but you want to • We will describe our new tool- XALT • First provide a little background • Then a brief description of XALT follows XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Robert McLayTACC XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
My Passions • Protect new user but stay out of vet's way • Make staff support efficient and effective • Automate detection, correction, prevention • Make the repeat tickets go away! XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Making a difference… Maintain consistent, compatible software environment Lmod and related tools $ module swap mvapich2 impi Inactive Modules: 1) vasp Due to MODULEPATH changes the following have been reloaded: 1) fftw3/3.3.2 $ module load mvapich2 Lmod Error: You can only have one MPI module loaded at a time. You already have impi loaded. XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Making a difference… Detect potential problems and alert users Lariat and related tools TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Making a difference… Job-level usage data on libraries and applications ALTD (Mark Fahey -- NICS) XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Joining forces… TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** Job-level usage data on libraries and applications Detect potential problems and alert users XALT ALTD Lariat XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
My own not-so-hidden agenda... • Looking for XALT beta users • Hungry for ideas, needs, feedback • Wanting to begin conversation with kindred souls XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Mark FaheyUTK XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
ALTD • Tracks • Which executables use the largest number of core hours? • Are they managed by center? Do they use the system efficiently? • Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install? • What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated enough? • Are there any executables with modification times older than 1 year? • Should we ask the user to recompile? • In use by several centers already • NERSC, NCCS, NICS, CSCS, NCSA/BW, and newest KAUST XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
ALTD is enabled on all major computing platforms at NERSC XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Applications of ALTD An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a ../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a • Understanding current library usage and plan for future software need • Providing usage statistics to developers and vendors • Restoring the program environment where user applications were built • Assisting with debugging system issues XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
ALTD at CSCS • In production at CSCS since 2011 • Rock solid: just a single downtime in two years • Rosa (Cray XE6) since March 2011 • 600K compilations, 2.8M jobs • Todi (Cray XK6/XK7) since October 2012 • 470K compilations, 500K jobs • Daint (Cray XC30) since March 2013 • 100K compilations, 550K jobs • We’ve added an additional SQL table “accounting” which logs more data about the application execution – number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, … • We want to be able to detect situations like the use of a buggy or non-performant library XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
How we mine data: ahypotheticsituation A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
First, find which users have linked this library mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ; +----------+ | username | +----------+ | tkachenn | | boswald | | liang | | robinson | | yunding | | zilia | +----------+ 5 rows in set (4.33 sec) • Querying the ALTD database reveals that several users have applications linked to the buggy library XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Now, check if they are using the buggy application • And it’s confirmed that user “robinson” is running the application linked to the buggy library • It’s now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson"; +---------+--------+------------------------+----------+------------+--------+---------------+ | run_inc | tag_id | executable | username | run_date | job_id | build_machine | +---------+--------+------------------------+----------+------------+--------+---------------| | 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| +---------+--------+------------------------+----------+------------+--------+---------------| 4 rows in set (0.65 sec) XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
This methodology is clearly unmanageable! • Ideally, user support specialists would be alerted automatically to “situations of interest” • Users running applications linked to legacy, less-performant, or buggy libraries • Users running legacy versions of applications • Users building code with legacy compilers • Users making use of their own libs or apps, when more optimized versions are available centrally How can we automate the processes of data mining, reporting and alerting? XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
TACC_Stats • Job-level transparent performance monitoring from HPC compute nodes • CPU performance counters • IB statistics • Lustre statistics • Scheduler job statistics • Host data • OS statistics • Analyses integrate available Lariat data (XALT in the future) XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
XALT: Understanding the Software Needs of High End Computer Users • NSF funded project • Combining the best of Lariat and ALTD • Collecting job-level and link-time level data and subsequent analytics • Alpha version for collection • Working on subsequent analytics • Building a community around analytics – potentially one of many tools • Will make it available to the community • Optional interface to XDMod/SUPREMME XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
XALT Goals • Goal is a census of libraries and applications and automatic filtering of user issues • what additional user problems can we detect and report (perhaps correct) automatically? • How can we leverage lessons learned by the tacc stats team to implement additional automatic filtering? • Plan to add tracking of function calls as well • Want to balance the need for portability with support for site-specific capabilities • Want to simplify the processes system administrators use to install, configure, and manage XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
XALT Agenda • New tracking infrastructure – XALT • Alpha version available today • Deployed at NICS and TACC • LANL and CSCS testing it • Some new functionality still to add • Detect function calls • Check runtime environment versus compile time env • Analytics • SourceForge • http://sourceforge.net/projects/xalt/ • xalt-users@lists.sourceforge.net • Want feedback, hungry for ideas XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Thanks to • Richard Gerber and Zhengji Zhao, NERSC • Tim Robinson, CSCS • Bill Barth, TACC • BilelHadri, KAUST • Julius Westerman, LANL XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers
Contact Info • Mark R. Fahey • mfahey@utk.edu • Robert McLay • mclay@tacc.utexas.edu XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers