1 / 18

MURI Hardware Resources

MURI Hardware Resources. Ray Garcia Erik Olson. Space Science and Engineering Center at the University of WI - Madison. Resources for Researchers. CPU cycles Memory Storage space Network Software Compilers Models Visualization programs. Original MURI hardware. 16 P III processors

kirsi
Download Presentation

MURI Hardware Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison

  2. Resources for Researchers • CPU cycles • Memory • Storage space • Network • Software • Compilers • Models • Visualization programs

  3. Original MURI hardware • 16 PIII processors • Storage server with 0.5 TB • Gigabit networking • Purpose: • Provide working environment for collaborative development. • Enable running of large multiprocessor MM5 model. • Gain experience working with clustered systems.

  4. Capabilities and Limitations • Successfully ran initial MM5 model runs, algorithm development (fast model), and modeling of GIFTS optics (FTS simulator). • MM5 model runs for 140 by 140 domains. One 270 by 270 run with very limited time steps. • OpenPBS system scheduling hundreds of jobs. • Idle CPU time given to FDTD raytracing. • Expanded to 28 processors using funding from B. Baum, IPO, and others. • However, MM5 model runtime limited domain size and storage space limited number of output time steps.

  5. CY2003 Upgrade • NASA provided funding for 11 Dual-Pentium4 processor nodes • 4GB DDR-RAM • 2.4GHz CPUs • Expressly purposed for running large IHOP field program simulations (400 by 400 grid point domain).

  6. Cluster “Mark 2” • Gains: • Larger scale model runs and instrument simulations as needed for IHOP • Terabytes of experimental and simulation data online through NAS, hosted RAID arrays • Limitations to further work at even larger scale • Interconnect limitations slowed large model runs • 32-bit memory limitation on huge model set-up jobs for MM5 and WRF • Increasing number of small storage arrays

  7. 3 Years of Cluster Work • Inexpensive • Adding CPUs to the system • Costly • Adding users to the system • Adding storage to the system • Easily understood • Matlab • Not so well-understood • Distributed system (computing, storage) capabilities

  8. Along comes DURIP • H.L.Huang / R.Garcia DURIP proposal awarded May 2004. • Purpose: Provide hardware for next generation research and education programs. • Scope: Identify computing and storage systems to serve the need to expand simulation, algorithm research, data assimilation and limited operational product generation experiments.

  9. Selecting Computing Hardware • Cluster options for numerical modeling were evaluated and found to require significant time investment. • Purchased SGI Altix fall of 2004 after extensive test runs with WRF and MM5. • 24 - Itanium2 processors running Linux • 192GB of RAM • 5TB of FC/SATA disk • Recently upgraded to 32 CPUs, 10TB storage.

  10. SGI Altix Capabilities • Large, contiguous RAM allows 1600 by 1600 grid point domain (> CONUS area at 4 km res). • Largest so far is 1070 by 1070. • NUMAlink interconnect provides fast turn around for model runs • Presents itself as a single 32-CPU Linux machine • Intel compilers for ease of porting and optimizing Fortran/C on 32-bit and 64-bit hardware.

  11. Storage Class: Home Directory • Small size for source code (preferably also held under CVS control) and critical documents • Nightly incremental backups • Quota enforcement • Current implementation • Local disks on cluster head • Backup by TC

  12. Storage Class: Workspace • Optimized for speed • Automatic flushing of unused files • No insurance against disk failure • Users expected to move important results to Long-term Storage • Current implementation • RAID5 or RAID0 drive arrays within the cluster systems

  13. Storage Class: Long-term • Large amount of space • Redundant, preferably back-up to tape • Managed directory system, preferably with metadata • Current implementation • Lots of project-owned NAS devices with partial redundancy (RAID5) • NFS spaghetti • Ad-hoc tape backup

  14. DURIP phase 2: Storage • Long term storage scaling and management goals: • Reduce or eliminate NFS ‘spaghetti’ • Include hardware phase-in / phase-out strategy in purchase decision • Acquire the hardware to seed a Storage Area Network (SAN) in the Data Center, improving uniformity and scalability • Reduce overhead costs (principally human time) • Work closely with Technical Computing group on system setup and operations for a long-term facility

  15. Immediate Options • Red Hat GFS • Size limitations and hardware/software mix-and-match; Support costs make up for free source code. • HP Lustre • More likely to be a candidate for workspace. Expensive. • SDSC SRB (Storage Resource Broker) • Stability, documentation, and maturity at time of testing found to be inadequate. • Apple Xsan • Plays well with third-party storage hardware. Straightforward to configure and maintain. Affordable.

  16. Dataset Storage Purchase Plan • 64-bit storage servers and meta-data server • Qlogic Fibre channel switch • Move data between hosts, drive arrays • SAN software to provide distributed filesystem • Focusing on Apple Xsan for 1-3 year span • Follow up with 1-year assessment with option of re-competing • Storage arrays • Competing Apple XRAID, Western Scientific Tornado

  17. Target System for 2006 • Scalable dataset storage accessible from clusters, workstations, and supercomputer • Backup strategy • Update existing cluster nodes to ROCKS • Simplified management and improve uniformity • Proven on other clusters deployed by SSEC • Retire/repurpose slower cluster nodes • Reduce bottlenecks to workspace disk • Improve ease of use and understanding

  18. Long-term Goals • 64-bit shared memory system scaled to huge job requirements (Altix) • Complementary compute farm migrating to x86-64 (Opteron) hardware • Improved workspace performance • Scalable storage with full metadata for long-term and published datasets • Software development tools for multiprocessor algorithm development

More Related