180 likes | 341 Views
MURI Hardware Resources. Ray Garcia Erik Olson. Space Science and Engineering Center at the University of WI - Madison. Resources for Researchers. CPU cycles Memory Storage space Network Software Compilers Models Visualization programs. Original MURI hardware. 16 P III processors
E N D
MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison
Resources for Researchers • CPU cycles • Memory • Storage space • Network • Software • Compilers • Models • Visualization programs
Original MURI hardware • 16 PIII processors • Storage server with 0.5 TB • Gigabit networking • Purpose: • Provide working environment for collaborative development. • Enable running of large multiprocessor MM5 model. • Gain experience working with clustered systems.
Capabilities and Limitations • Successfully ran initial MM5 model runs, algorithm development (fast model), and modeling of GIFTS optics (FTS simulator). • MM5 model runs for 140 by 140 domains. One 270 by 270 run with very limited time steps. • OpenPBS system scheduling hundreds of jobs. • Idle CPU time given to FDTD raytracing. • Expanded to 28 processors using funding from B. Baum, IPO, and others. • However, MM5 model runtime limited domain size and storage space limited number of output time steps.
CY2003 Upgrade • NASA provided funding for 11 Dual-Pentium4 processor nodes • 4GB DDR-RAM • 2.4GHz CPUs • Expressly purposed for running large IHOP field program simulations (400 by 400 grid point domain).
Cluster “Mark 2” • Gains: • Larger scale model runs and instrument simulations as needed for IHOP • Terabytes of experimental and simulation data online through NAS, hosted RAID arrays • Limitations to further work at even larger scale • Interconnect limitations slowed large model runs • 32-bit memory limitation on huge model set-up jobs for MM5 and WRF • Increasing number of small storage arrays
3 Years of Cluster Work • Inexpensive • Adding CPUs to the system • Costly • Adding users to the system • Adding storage to the system • Easily understood • Matlab • Not so well-understood • Distributed system (computing, storage) capabilities
Along comes DURIP • H.L.Huang / R.Garcia DURIP proposal awarded May 2004. • Purpose: Provide hardware for next generation research and education programs. • Scope: Identify computing and storage systems to serve the need to expand simulation, algorithm research, data assimilation and limited operational product generation experiments.
Selecting Computing Hardware • Cluster options for numerical modeling were evaluated and found to require significant time investment. • Purchased SGI Altix fall of 2004 after extensive test runs with WRF and MM5. • 24 - Itanium2 processors running Linux • 192GB of RAM • 5TB of FC/SATA disk • Recently upgraded to 32 CPUs, 10TB storage.
SGI Altix Capabilities • Large, contiguous RAM allows 1600 by 1600 grid point domain (> CONUS area at 4 km res). • Largest so far is 1070 by 1070. • NUMAlink interconnect provides fast turn around for model runs • Presents itself as a single 32-CPU Linux machine • Intel compilers for ease of porting and optimizing Fortran/C on 32-bit and 64-bit hardware.
Storage Class: Home Directory • Small size for source code (preferably also held under CVS control) and critical documents • Nightly incremental backups • Quota enforcement • Current implementation • Local disks on cluster head • Backup by TC
Storage Class: Workspace • Optimized for speed • Automatic flushing of unused files • No insurance against disk failure • Users expected to move important results to Long-term Storage • Current implementation • RAID5 or RAID0 drive arrays within the cluster systems
Storage Class: Long-term • Large amount of space • Redundant, preferably back-up to tape • Managed directory system, preferably with metadata • Current implementation • Lots of project-owned NAS devices with partial redundancy (RAID5) • NFS spaghetti • Ad-hoc tape backup
DURIP phase 2: Storage • Long term storage scaling and management goals: • Reduce or eliminate NFS ‘spaghetti’ • Include hardware phase-in / phase-out strategy in purchase decision • Acquire the hardware to seed a Storage Area Network (SAN) in the Data Center, improving uniformity and scalability • Reduce overhead costs (principally human time) • Work closely with Technical Computing group on system setup and operations for a long-term facility
Immediate Options • Red Hat GFS • Size limitations and hardware/software mix-and-match; Support costs make up for free source code. • HP Lustre • More likely to be a candidate for workspace. Expensive. • SDSC SRB (Storage Resource Broker) • Stability, documentation, and maturity at time of testing found to be inadequate. • Apple Xsan • Plays well with third-party storage hardware. Straightforward to configure and maintain. Affordable.
Dataset Storage Purchase Plan • 64-bit storage servers and meta-data server • Qlogic Fibre channel switch • Move data between hosts, drive arrays • SAN software to provide distributed filesystem • Focusing on Apple Xsan for 1-3 year span • Follow up with 1-year assessment with option of re-competing • Storage arrays • Competing Apple XRAID, Western Scientific Tornado
Target System for 2006 • Scalable dataset storage accessible from clusters, workstations, and supercomputer • Backup strategy • Update existing cluster nodes to ROCKS • Simplified management and improve uniformity • Proven on other clusters deployed by SSEC • Retire/repurpose slower cluster nodes • Reduce bottlenecks to workspace disk • Improve ease of use and understanding
Long-term Goals • 64-bit shared memory system scaled to huge job requirements (Altix) • Complementary compute farm migrating to x86-64 (Opteron) hardware • Improved workspace performance • Scalable storage with full metadata for long-term and published datasets • Software development tools for multiprocessor algorithm development