“Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly Parallel Problems

“Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly Parallel Problems Presented to: GoldSim Users Conference - 2007 October 25, 2007 San Francisco, CA Presented by: Patrick D. Mattie, M.S., P.G. Senior Member of Technical Staff Sandia National Laboratories Contributions by: Stefan Knopf, GTG and Randy Dockter, SNL-YMP OFFCIAL USE ONLY

Presentation Outline • Cluster Computing Defined • GoldSim and Beowulf? • ‘COTS’ Cluster Computing using GoldSim • GoldSim and E.T.? • Example Cluster • TSPA-Wulf • What is next? Pushing the limits…. OFFCIAL USE ONLY

What is Cluster Computing?What is a Beowulf Cluster? Background OFFCIAL USE ONLY

Cluster Computing Defined • What is a compute cluster? • A Cluster is a widely-used term meaning independent computers combined into a unified system through software and networking. At the most fundamental level, when two or more computers are used together to solve a problem, it is considered a cluster. • Clusters are typically used for High Availability (HA) for greater reliability or High Performance Computing (HPC) to provide greater computational power than a single computer can provide. OFFCIAL USE ONLY

Beowulf Class Cluster • Beowulf Class Cluster is a simple design for high-performance computing clusters on inexpensive personal computer hardware. • Originally developed in 1994 by Thomas Sterling and Donald Becker at NASA • Beowulf Clusters • are scalable performance clusters • based on commodity hardware • require no custom hardware or software • A Beowulf Cluster is constructed from commodity computer hardware (Dell, HP, IBM, etc.) as simple as two networked computers sharing a file system on the same LAN or as complex as thousands of nodes with a high-speed, low-latency interconnects (networking) • Common uses are traditional technical applications such as simulations, biotechnology, and petroleum; financial market modeling, data mining and stream processing. • http://www.beowulf.org OFFCIAL USE ONLY

Advantages of a Beowulf Class Cluster • Less computation time then running a serial process • COTS –’Commodity Off the Shelf’ • Doesn’t require a big budget • Doesn’t require specialized skill set • Can be built using existing computer resources and Local Area Networks (LAN) • Can be constructed over different system configurations/brands/resources • Useful for solving embarrassingly parallel problems OFFCIAL USE ONLY

Why do I need a cluster? • An embarrassingly parallel problem is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks, and there is no essential dependency (or communication) between those parallel tasks • A Monte Carlo simulation is an embarrassingly parallel problem • For example: a 100 realization simulation can be broken into 100 separate problems, each solved independently from the other. • http://en.wikipedia.org/wiki/Embarrassingly_parallel OFFCIAL USE ONLY

Why do I need a cluster? • 100 realization run takes 1 minute per realization • One Computer (or core): • ~1.6 hours • On four computers (or cores): • 25 minutes • Ten computers (or cores): • 10 minutes OFFCIAL USE ONLY

Cluster Computing Using GoldSimPro • GoldSim Distributed Processing Module • The Distributed Processing Module uses multiple copies of GoldSim running on multiple machines (and/or multiple processes within a single machine that has a multi-core CPU) • Grid Computing: Slaves Master OFFCIAL USE ONLY

Cluster Computing - Distributed Processing "Distributed" or "grid computing" - in general is a special type of parallel computing which relies on complete computers (with onboard CPU, storage, power supply, network interface, etc.) connected to a network (private, public or the internet) by a conventional network interface, such as Ethernet. Examples include: • SETI@home Project: http://setiathome.ssl.berkeley.edu/ Analyzing radio telescope data in search of extraterrestrial intelligence OFFCIAL USE ONLY

Cluster Computing Using GoldSimPro There are two versions of the Distributed Processing Module: • GoldSim DP (comes with all versions of GoldSim) • GoldSim DP Plus (licensed separately) OFFCIAL USE ONLY

“Beowulfery” - YMP & GoldSim A Cluster Computing Example OFFCIAL USE ONLY

TSPA-Wulf – Cluster Configuration • Window Server 2003, and Windows 2000 Advanced Server (3GB) • Network simulations (master-slave) • About 220 Intel Xeon 3.6 GHz dual-processor nodes with 8 GB RAM per machine, on a GigE LAN • 60 Intel Xeon 3.0 GHz dual-processor dual-core nodes with 16 GB RAM per machine, on a GigE LAN • One realization per slave CPU—after a slave CPU finishes one realization it accepts another from the master server • 680 processors available (plus 62 legacy processors) • 752 total OFFCIAL USE ONLY

OFFCIAL USE ONLY

File Server Master Computer Slave Computers Cases are run by GoldSim as a distributed process from a directory on a Master. Individual realizations are run by GoldSim processes on Slaves. Storage area for TSPA model file • Controlled Storage area for: • Parameter Database • DLLs • input files Storage area for completed TSPA cases. Running the Model -- Overview OFFCIAL USE ONLY

Set-Up On the Master Computer • TSPA model file • TSPA model file (1) Manually move model file to the Master computer. (2) Set-up model file to run specific case. (4) Document changes - conceptual write-up - check list - version control file • Parameter Database • -parameter values • - links to DLLs • - links to input files (3a) Global download of parameter values to model file. • input files • DLLs • input files • DLLs (3b) Global download transfers input files and DLLs to the Maste computer. storage areas onfile server directory on master computer (Transfers occur over LAN) File Server Master Computer OFFCIAL USE ONLY

Running - Transfers to Slaves • PA02 • -Networked1 • - Networked2 • TSPA model file • DLLs • input files • At the start of the distributed process: • A “Networked” directory is created for each processor on each Slave computer. • GoldSim slave process is started for each processor on each Slave computer. • Model file transferred • DLLs transferred • Input files transferred • PA03 • -Networked1 • - Networked2 directory on master server • PA04 • -Networked1 • - Networked2 (2) Information (i.e., LHS sampling) for each realization is transferred to slave processes as they are available. 144 other slave computers Master Computer Slave Computers (Transfers occur over LAN) OFFCIAL USE ONLY

Running - Transfers from Slaves • TSPA model file • PA02 • -Networked1 • - Networked2 (2) GoldSim loads the .gsr files into the model file when all realizations are completed. • PA03 • -Networked1 • - Networked2 (1) .gsr files transferred as each realization is completed. • .gsr files • one per realization • PA04 • -Networked1 • - Networked2 • DLLs • input files 144 other slave computers directory on master computer (Transfers occur over LAN) Slave Computers Master Computer OFFCIAL USE ONLY

TSPA Model Architecture • File size and count • 645 input files (approximately 5 GB in size) • 14 DLLs • GoldSim file with no results (pre-run) is about 200 MB in size • GoldSim file after a run is about 5 to 6 GB in size (compressed); however, there is no intrinsic limitation other than the slowness of file manipulation on a 32-bit operating system OFFCIAL USE ONLY

TSPA-Wulf Benchmarks • 1,000 realizations @ 90 minutes per realization • 62.5 Days to run serial mode • 120 processors would take ~ 12.5 hours • 99% faster • A Typical 1,000,000-year, 1000-realization run (about 470 time steps) requires 24 hours on 150 CPUs (75 dual processor single core nodes, 32-bit, 2.8-3.0 GHz) OFFCIAL USE ONLY

What comes next? OFFCIAL USE ONLY

SNL/GoldSim HPCC R&D • GoldSim evolution/migration to Microsoft HPC • Migration from 32-bit to 64-bit architecture? • Optimize modeling system for Microsoft HPC • Combined SNL/Microsoft/GoldSim task • Link GoldSim with the Microsoft CCS scheduler tool to automatically queue jobs and ‘on the fly’ prioritize or re-prioritize job resources. • Microsoft’s developers working with GoldSim • True Parallel processing? • Using OpenMP to take advantage of multi-cores • Optimize HPC Software for large compute cluster • Combined SNL/Microsoft task OFFCIAL USE ONLY

“Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly Parallel Problems