390 likes | 524 Views
The MSU Institute for Cyber Enabled Research As a Resource for Theoretical Physics. 2010/11/09 – Eric McDonald. Outline. The Institute for Cyber Enabled Research (iCER) The High Performance Computing Center (HPCC) (Upcoming Computational Physics Talks at the Institute)
E N D
The MSU Institute for Cyber Enabled Research As a Resource for Theoretical Physics 2010/11/09 – Eric McDonald
Outline • The Institute for Cyber Enabled Research (iCER) • The High Performance Computing Center (HPCC) • (Upcoming Computational Physics Talks at the Institute) • Work on the NuShellX Nuclear Shell Model Code Using HPCC Resources • How Can iCER Serve NSCL and FRIB? • NSCL-iCER Liaising • Questions? Discussion
What is iCER? • Established in 2009 to encourage and support the application of advanced computing resources and techniques by MSU researchers. • Goal is to maintain and enhance the university's national and international standing in computational disciplines and research thrusts.
Organization of iCER • High Performance Computing Center • HPC Programmers / Domain Specialists • Research Consultants and Faculty • Clerical Staff • External Advisory Board • Steering Committee • Executive Committee • Directorate
What does iCER provide? • High Performance Computing Systems (via HPCC) • Buy-in Opportunities (departmental liaisons / domain specialists, privileged access to additional HPCC hardware) • Education (weekly research seminars and HPCC technical talks, hosting of workshops and virtual schools, special topic courses) • Consulting • Collaboration and Grant Support
What is HPCC? • Established in late 2004 to provide campus-wide high performance computing facility for MSU researchers. • HPC systems are free to use. No project or department accounts are charged. • One goal is to help researchers capture research and funding opportunities that might not be otherwise possible.
Organization of HPCC • Systems Administrators (full-time on-call staff; each has one or more areas of specialty, such as databases, scheduling, sensors, or storage) • iCER Domain Specialists / HPC Programmers (have “superuser” privileges on HPCC systems and work closely with the systems administrators) • Clerical Staff • Director
HPCC Hardware History I • 2005: 'green'. 1 “supercomputer” (128 Intel IA-64 cores), 512 GiB of RAM, ~500 GFLOPS. • 2005: 'amd05' cluster. 128 nodes (512 AMD Opteron x86-64 cores), 8 GiB of RAM per node (1 TiB total), ~2.4 TFLOPS. • 2007: 'intel07' cluster. 128 nodes (1024 Intel Xeon x86-64 cores), 8 GiB of RAM per node (1 TiB total), ~9.5 TFLOPS.
HPCC Hardware History II • 2009: 'amd09' cluster. 5 “fat nodes” (144 AMD Opteron x86-64 cores), 128 or 256 GiB per node (1.125 TiB total). • 2010: 'gfx10' GPGPU cluster. 32 nodes (256 Intel Xeon x86-64 cores), 18 GiB per node (576 GiB total), 2 nVidia Tesla M1060 GPGPU accelerators per node, 480 GPGPU cores per node (15360 GPGPU cores total).
HPCC Hardware History III • Coming Soon: 'intel10' cluster. 188 nodes (1504 Intel Xeon x86-64 cores), 24 GiB per node (4.40625 TiB), ~14.7 TFLOPS. • (Installation pictures on next slide.) • What's next? You tell us. More fat nodes, another GPGPU cluster, ...?
Hardware Buy-In Program I • When HPCC is planning to purchase a new cluster, users are given a window in which they may purchase nodes to add to it. • Buy-in users have privileged access to their nodes in that they can preempt other users' jobs on those nodes. • When buy-in nodes are not specifically requested by their purchasers, then jobs are scheduled on them as with the non-buy-in nodes.
Hardware Buy-In Program II • HPCC maintains a great deal of infrastructure which buy-in users can regard as a bonus: • Power • Cooling • Support Contracts • Mass Storage • High-Speed Interconnects • On-Call Staff • Security • Sensor Monitoring • Compare to building and maintaining your own cluster....
Data Storage I • Initial home directory storage quota is 50 GB. Can be temporarily boosted in 50 or 100 GB increments, up to 1 TB, if a good research reason is given. • http://www.hpcc.msu.edu/quota • Shared research spaces may also be requested. • http://www.hpcc.msu.edu/contact • Allocations beyond 1 TB are sold in 1 TB chunks, currently at US$500 per TB. This factors in staff time and infrastructure costs as well as raw storage costs.
Data Storage II • Snapshots of home directories are taken every hour. Users can recover accidentally-deleted files from these snapshots without intervention from systems administrators. • Home and research directories are automatically backed up daily. • Replication to off-site storage units is performed. • Scratch space is not backed up.
High Speed Networking • Most nodes are connected via an Infiniband fabric. • Individual node throughputs can reach up to 10 Gbps. • Access to network data storage over this link. • MPI libraries have Infiniband support, and so MPI-parallelized codes can pass messages over these low latency links. (As low as 1.1 μs endpoint-to-endpoint message passing.)
Software Overview I • Operating System Kernel: Linux 2.6.x • Compiler collections (C, C++, Fortran 77*, Fortran 90/95/2003/2008-ish*) from multiple vendors available. • GCC (FSF): 4.1.2 (default), 4.4.4 • Intel: 9.0, 10.0.025, 11.0.083 (default), 11.1.072 • Open64: 4.2.3 • Pathscale: 2.2.1, 2.5, 3.1 (default) • PGI: 9.0, 10.1 (default), 10.9 * Some restrictions may apply. See compiler documentation for details.
Software Overview II • MPI Libraries • MVAPICH (essentially MPICH with Infiniband support): 1.1.0 • OpenMPI: 1.3.3, 1.4.2 (default version depends on compiler version) • Math Libraries • Intel Math Kernel Library (MKL): 9.1.021, 10.1.2.024 (default) • Fastest Fourier Transform in the West (FFTW): 2.1.5 (with MPI support), 3.2.2 (default)
Software Overview III • Debuggers • gdb 6.6 • TotalView 8.7 (has GUI interface and MPI support) • GPGPU Programming Tools • PGI Fortran Compilers (CUDA support in version 10.x) • nVidia CUDA Toolkit: 2.2, 2.3 (default), 3.1
Software Overview IV • Matlab 2009a • Version Control Systems • CVS • Git • Mercurial • Subversion • Miscellaneous Developer Tools • cmake • Doxygen
Coming Software... • Math Libraries • AMD Core Math Library (ACML) • ATLAS • PETSc • SuperLU • Trilinos • Mathematica 7.1 • SciPy (including NumPy) • Visualization Tools • ParaView • VisIt
Using HPCC I • Interactive logins via 'gateway.hpcc.msu.edu'. • Need SSH client. Several good ones exist for Windows; try 'PuTTY', for example. MacOS X and Linux distributions come bundled with one. • 'gateway' is a gateway. Login to developer nodes from here to do work. However, you can submit jobs to the batch queues from here. • Developer nodes are for compiling codes and running short (<10 minutes) tests.
Using HPCC II • Architectures of developer nodes are representative of cluster compute nodes. • 'white' ↔ 'green' • 'dev-amd05' ↔ 'amd05' cluster • 'dev-intel07' ↔ 'intel07' cluster • 'dev-gfx08' ↔ nothing • 'dev-intel09' ↔ nothing • 'dev-amd09' ↔ 'amd09' fat nodes • 'dev-gfx10' ↔ 'gfx10' GPGPU cluster • 'dev-intel10' ↔ 'intel10' cluster
Using HPCC IV • Files can be accessed without logging into 'gateway' via SSH. • File servers containing your home directories and research spaces can be accessed via a CIFS connection. CIFS is the same protocol which Windows uses for file sharing. Mac OS X talks CIFS as well. • For more information: • https://wiki.hpcc.msu.edu/x/VYCe
Upcoming iCER Seminars • Erich Ormand – LLNL • “Is High Performance Computing the Future of Theoretical Science?” • 2010/11/11 (this Thursday) 10:30 AM • BPS 1445A • Bagels and coffee • Joe Carlson – LANL • 2010/12/09 10:30 AM • For more information: • https://wiki.hpcc.msu.edu/x/q4Cy
NuShellX I • Nuclear shell model code. • Developed by Bill Rae from Oxford, who has created other shell model codes: Oxbash, MultiShell, and NuShell. • Uses a Lanczos iterator to find the energies. • Code can also produce transition rates and spectroscopic factors.
NuShellX II • Another iterative solver, which implements a scheme known as Thick Restart Lanczos, is also available. • Part of my project with Alex is to verify the correctness of this implementation and fix accuracy issues that may arise in some cases. • May also investigate implementation of application-level checkpointing in this context.
NuShellX III • Some work done on NuShellX using HPCC. • Ported the code, including Alex's wrappers, to Linux, and provided a Makefile-based build system. • Generalized code away from Intel Fortran compiler (ifort), so that other compilers may be used to build it. In particular, HPCC's PGI compilers were used. • Explored scalability of the OpenMP version of the code using HPCC fat nodes.
NuShellX IV • Some planned work on NuShellX which will use HPCC resources: • Working bugs out of the MPI version. (In progress.) • Performance tuning the MPI version. • Experimenting with a CUDA implementation on the GPGPU cluster. • Goal is to prepare for big, long runs on NERSC and ANL ALCF machines.
What can iCER do for you? • Can work together to pursue grant opportunities, especially in computational physics. • Can provide dedicated computing power via the HPCC buy-in program. • Can provide a cost-free stepping stone to the petascale “leadership class” machines at national labs, Blue Waters, etc.... (No charges for CPU time while debugging and testing on a significant scale.)
iCER Online Resources I • The core of the iCER/HPCC web: • https://wiki.hpcc.msu.edu • iCER Home Page: • http://icer.msu.edu/ • Announcements: • https://wiki.hpcc.msu.edu/x/QAGQ • Calendar: • https://wiki.hpcc.msu.edu/x/q4Cy
iCER Online Resources II • New Account Requests (requires login with MSU NetID and password): • http://www.hpcc.msu.edu/request • Software Installation Requests, Problem Reports, etc... (requires login with MSU NetID and password): • http://rt.hpcc.msu.edu/ • Documentation: • https://wiki.hpcc.msu.edu/x/A4AN
iCER-NSCL Liaising I • Successful partnership between BMB and iCER has produced a domain specialist / HPC programmer who is a liaison between the two organizations. • iCER and HPCC leadership looking to extend this model to other organizations. • Hence this talk. Hence the availability of someone with physics background to be a liaison. • (Work with NSCL-at-large paid for by iCER, not Alex Brown.)
iCER-NSCL Liaising II • Proposed liaison is fluent in C, C++, Python, and Mathematica. Also has some proficiency in Fortran 77/90. • Proposed liaison has experience with parallelization. • Proposed liaison has accounts at several DOE facilities (and always looking to add more elsewhere). Knowing what to expect at the petascale can help preparations at the “decaterascale”.
iCER-NSCL Liaising III • Proposed liaison has superuser privileges on the HPCC clusters. • Can rectify many problems without waiting for a systems administrator to address them. • Can install software without needing special permission or assistance. • Proposed liaison is “in the loop”. Participates in both HPCC technical meetings and iCER organizational meetings.
Thank you. Questions? Discussion.