10 likes | 137 Views
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD , X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space Sciences Dept. & Computer Sciences Dept., Florida Institute of Technology, 150 W. University Blvd, Melbourne, FL 32901. Abstract.
E N D
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRAPhysics and Space Sciences Dept. & Computer Sciences Dept., Florida Institute of Technology, 150 W. University Blvd, Melbourne, FL 32901 Abstract Software CMS Software and Computing In addition to our contribution of resource hours to the CMS experiment, we have been actively participating in improving the overall operation of the Grid. First, we attended the CMS Tier 3 conference at the Fermi National Lab in Illinois (a Tier 1) where we discussed, along with other Tier 3 admins, the scope and objectives of a Tier 3 site. On the hardware side, we have performed network filesystem tests with the University of Florida in order to determine whether a Lustre filesystem mount is an effective way to distribute software and data between sites. A sample of our results is given below. • ROCKS Cluster Operating System • We have upgraded the cluster operating system to Rocks 5.0 and optimized the installation profiles of compute nodes and network attached storage. The HPC cluster at Florida Tech is two years into development and has met several milestones that effectively finalize its construction and implementation. The system has been upgraded to the latest versions of the Rocks OS and the Condor batch-job manager. In addition to software upgrades, the cluster has been integrated into the Open Science Grid Production grid and has become an official USCMS Tier-3 compute element, having processed 125,000 hours of CMS data to-date. We have also allowed several faculty members to use our resources alongside our own Muon Tomography simulations. The hardware has been upgraded with top-of-the-line machines resulting in 160 available processor cores. We detail the final design and performance of the cluster, as well as the core configuration of the system. The concept of Tier-3 sites and our participation in the CMS project is outlined. Results from 2 benchmarks: **dd 107.82MB/s Write 72.47MB/s Read **bonnie++ 90.79MB/s Write 65.27MB/s Read (Write FIT->UF Read UF->FIT) Wallclock Hours Figure 3: Ganglia Cluster Monitoring Job Count • Condor Batch-job Manager • Condor has been upgraded to version 7.2, giving the cluster increased security and troubleshooting ability. We have redesigned the batch scheduler to disable job pre-emption - meaning jobs will always run to completion and then give up their slot to a higher-priority job. This is an optimization for grid computing since most batch jobs do not checkpoint (save) their progress periodically. Hardware Phase I of the FLTECH cluster hardware has reached completion. The cluster consists of 20 new professionally-built servers with 8 Xeon CPUs and 16GB of RAM in each machine. The Network-attached-storage is a similar machine but with ~10TB of data storage in a RAID6 configuration. User home directories and important research data are stored on the NAS. Figure 6: Utilization of CMS T3 Sites as monitored by the Gratia tool Figure 4: Machines available to Condor (above), and running Jobs (right) The benchmarked performance of the cluster is a quarter trillion floating point operations per second (250 GFLOPS). Conclusion, Summary & Outlook Figure 1: New High-end Cluster Hardware (NAS) Open Science Grid & CMS Tier 3 Production Site Being firmly established on the OSG and contributing computing resources to CMS simulations, our site has become an official Tier 3 CMS site - thus concluding Phase I of the project. We are currently adding a dedicated development node with 64GB RAM for running experimental code that can have large memory footprints (such as Expectation Maximization algorithms). We will also be expanding the types of CMS jobs that the cluster can process, including data sets recorded by the CMS detector at the Large Hadron Collider. Visit http://uscms1.fltech-grid3.fit.edu/wiki to follow this project. In the summer of 2008, we moved the cluster registration from the integration test bed to the OSG production grid and began processing real grid workflows. Upon achieving this, we opened our resources to the World LHC Computing Grid for CMS data processing. Due to our optimizations for grid computing, meeting the requirements for processing CMS jobs was painless - and required only a few tweaks to our grid middleware. We are now recognized as a CMS Tier 3 site, and have since contributed well over 125,000 resource hours to CMS. References and Acknowledgments Rocks Clusters User Guide: http://www.rocksclusters.org/roll-documentation/base/5.0/ Open Science Grid: http://www.opensciencegrid.org/ Condor v7.2 Manual: http://www.cs.wisc.edu/condor/manual/v7.2/ For further information, contact pford@fit.edu. Thanks to Bockjoo Kim (UF-USCMS) and the OSG-GOC for their guidance. Figure 5: A map of CMS Tier 2 and 3 sites. Our site is located on the east coast of Florida. (B. Kim, U. of Florida) Figure 2: New cluster topology including all new servers (above). All new hardware incorporated into a single 50U rack (left)