130 likes | 256 Views
Overview of the Computer Resource Team (CRT) Blaise Barney (LLNL) Rob Cunningham (LANL) Barbara Jennings (Sandia) PSAAP Kickoff Meeting July 8, 2008 Albuquerque, NM. LLNL-PRES-405061 . What Is The CRT?.
E N D
Overview of the Computer Resource Team (CRT) Blaise Barney (LLNL) Rob Cunningham (LANL) Barbara Jennings (Sandia) PSAAP Kickoff Meeting July 8, 2008 Albuquerque, NM LLNL-PRES-405061
What Is The CRT? • The Computer Resource Team (CRT) is the component of the PSAAP program that connects Alliance researchers to the High Performance Computing (HPC) resources required to perform their work • The CRT is comprised of a representative from each NNSA Lab who is familiar with their lab's computing resources, personnel and policies. The following individuals serve on the CRT: • Blaise Barney, LLNL • Rob Cunningham, LANL • Barbara Jennings, SNL • Our primary purpose is to provide assistance and guidance in all aspects related to the use of HPC resources located at LANL, LLNL, Sandia (and SDSC)
What Does The CRT Do For You? • Track and facilitate the resolution of problems reported to each Labs' customer support “hotline” • Provide training opportunities • Collect and distribute monthly machine usage statistics • Schedule and support special/dedicated runs • Maintain a balance of machine usage between the Alliances • Conduct annual Alliance visits to discuss HPC resources, user issues and to offer technical consultation and/or training • Showcase Alliance research in the NNSA/ASC research exhibit booth at the annual SC conference • Assist with the establishment and use of computer accounts • Assist with accessing compute resources • Provide essential HPC user documentation • Provide technical support and referral to in-depth consulting • Conduct monthly telecons to keep Alliance users up-to-date with account, access, policy, scheduling and technical issues, and to address issues with HPC platform usage • Interface with other individuals and groups within the Labs, such as management, networking, system administration, storage, customer support, etc., to facilitate the effective support of Alliance users
Computer Accounts • Alliances need at least one account authorizer. This can be a PI, POC and/or a trustworthy, knowledgeable designee • Account authorizers are responsible for overseeing the accounts and machine usage for all of their Center's users • Each Lab has its own policies, forms and procedures, however there is a single entry portal (sarape.sandia.gov) for requesting an account at any of the 3 labs • Account processing for non-US citizens requires additional time and “paperwork” - allow 30-90 days (plan ahead) • Having a “backup” authorizer is important if the primary authorizer is often not available • The CRI has sent all PSAAP POCs and PIs “quick sheets” for getting started with account requests and account management. • Questions? Contact your CRT representative (depends upon the Lab where the account is requested)
Computer Access • To access any machine, you must first have an account on that machine • As with accounts, each lab has its own access policies and procedures • All 3 labs require a valid computer account, ssh and use of a password generating device (cryptocard / one-time token), which is sent to you after your initial account request is approved • Additionally, LLNL requires remote users to access resources through VPN (virtual private network): • Makes your local machine appear to be on the LLNL network • VPN accounts are included with original account applications • Requires a one-time software download, install and config - or - simply connect via a web interface
User Documentation • Most of what users need to know is available online via web pages hosted by each of the labs. Recommended starting points: • LLNL • computing.llnl.gov • computing.llnl.gov/tutorials/lc_resources • LANL • computing-int.lanl.gov • int.lanl.gov/projects/asci/training/Intro • Sandia • hpc.sandia.gov • clik.sandia.gov • SDSC • www.sdsc.edu/us • Access to this information varies: • LLNL, SDSC: most web pages are open – no authentication required • Sandia, LANL: most web pages require authentication (need an account setup first)
HPC Training • Training is important – especially for new users • Online tutorials are available (see previous User Documentation links) • Workshops conducted at the Labs are open to Alliance users • Training delivered at your Center or over the Access Grid is also possible • Topics include: • Getting Started Information • Compilers • Performance tools, Optimization • Debuggers • Parallel programming (MPI, OpenMP, Pthreads…) • Batch schedulers • Architectures (Purple, Redstorm, TLCC, etc.) • Visualization tools • Topic specific, customized training? The CRT can assist here too.
Customer Support and Problem Tracking • Customer support for technical and accounting issues is available via phone and email during normal business hours: • Problems and questions are tracked via a customer support database application (varies with each Lab). • Most problems/questions are handled via “Tier 1” support – the “hotline” at each Lab. • More in-depth issues are typically referred to local “Tier 2” support – a specialist. • The labs coordinate with hardware and software vendors for issues that require outside “Tier 3” support. • Off-hours support handled by Operations staff • CRT reps coordinate regularly with each other on Tri-lab user issues.
Dedicated Runs (DATs) • Normally, Alliance users share machine usage with other users - jobs are typically submitted to a batch system, queued, and wait their turn for execution. • Additionally, there are limits on the number of nodes and number of hours that a job can use. • Exclusive use of a machine (dedicated application time - DAT) can be requested by any Alliance. For example, at LLNL: • Most weekends are dedicated to Alliance use of the ALC and UP clusters • Normal node/time limits are not in effect • No other user jobs are run - only those of the scheduled Alliance(s) • How to request a DAT: • LLNL: computing.llnl.gov/forms/ASC_dat_form.html • LANL: email to consult@lanl.gov • Sandia: email to redstorm-help@sandia.gov
Communications • Monthly telecons and email list (asap-crt@lanl.gov) • Active participation by all 8 Alliances, LLNL, LANL, Sandia and SDSC • Forum for discussion/questions on user topics such as accounts, access, technical issues, machine schedules, etc. • First Wed each month, 1:00pm Pacific time • Toll-free number hosted by the CRT: 866-914-3976 code: 187522# • Minutes are distributed via our email list to all Alliances, ASC HQ and various staff & managers within the Labs • Let us know if you want anyone else at your Center added to our list - initially it includes only your POC and PI • Usage stats • Collected by the CRT and distributed with the telecon minutes • Present both aggregate and detailed usage (down to the user level) for each Lab (and SDSC).
Communications • Email & phone • Customer support staff at each lab are available for assistance and are also active in sending out important machine/network status notices. • The CRT can be contacted directly by any of your Center's users: • Blaise Barney (LLNL) blaiseb@llnl.gov 925-422-2578 • Rob Cunningham (LANL) rtc@lanl.gov 505-665-4444 x05704 • Barbara Jennings (Sandia) bjjenni@sandia.gov 505-845-8554 • Visits • Annual visits (2-4 hrs) to the Alliances by the CRT and Lab customer support staff: • Focus is on the Alliance users of HPC computing resources • Updates on architectures, policies, future plans at the Labs • Forum for discussing user issues, problems, questions • We can include technical "training" sessions also if desired • We'll be contacting you soon to setup an initial visit - after your users have accounts - possibly Sep-Oct time frame?