230 likes | 383 Views
Progress Towards Petascale Virtual Machines. Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist. EuroPVM-MPI 2003 Venice, Italy September 30, 2003. Petascale Virtual Machine Another kind of “PVM”. This talk will describe: DOE Genomes to Life Project
E N D
Progress Towards Petascale Virtual Machines Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist EuroPVM-MPI 2003 Venice, Italy September 30, 2003
Petascale Virtual Machine Another kind of “PVM” • This talk will describe: • DOE Genomes to Life Project • PVM use today in the Genomics Integrated Supercomputer Toolkit for fault tolerance, and high availability in a dynamic environment • Harness Project (next generation of PVM) and its features to help scale to Petascale systems • Distributed peer-to-peer control • H2O – the self adapting core of Harness • FTMPI – fault tolerant MPI • Latest superscalable algorithms with natural fault tolerance for petascale environments.
DOE Genomes to Life Program Understanding the Essential Processes of Living Systems Follow-on to Human Genome Program • Determined the entire DNA sequence for humans • 24 chromosomes in 6 ft of DNA • 3 billion nucleotides code for 35,000 genes • Only 0.0001% difference between people. Instructions to build a human fits on a DVD (3GB) Genomes to Life Program goal is to read the Instructions starting with simple single cell organisms - microbes • Molecular Machines • Regulatory Pathways • Multi-cell Communities Develop new computational methods to understand complex biological systems $100M effort PVM
Molecular Machines Fill Cells Many interlinked proteins form interacting machines From The Machinery of Life, David S. Goodsell, Springer-Verlag, New York, 1993. www.genomes-to-life.org
Regulatory Networks Control the Machines Gene regulation controls what genes are expressed - And - Proteome changes over time and due to environmental conditions www.genomes-to-life.org
GTL will Require Petascale Systems Cell-based community simulation Protein machine Interactions 1000 TF 100 TF 10 TF 1 TF* Molecule-based cell simulation Molecular machine classical simulation Cell, pathway, and network simulation Community metabolic regulatory, signaling simulations Constrained rigid docking Constraint-Based Flexible Docking Current U.S. Computing Genome-scale protein threading Comparative Genomics *Teraflops Biological Complexity
Biology for the21st Century GTL is going to rely on high-performance computing and data analysis to process high-throughput experimental data The new computational biology environments will be conceptually integrated “knowledge enabling” environments that couple diverse sets of distributed data, advanced informatics methods, experiments, modeling, and simulation. simulation protein structure Data analysis genomes pathways Raw data modeling regulatory elements experiment models
Genome Integrated Supercomputer Toolkit GIST is a framework for large-scale biological application deployment • provides a transparent and high-performance interface to biological applications • provides transparent access to distributed data sets • utilizes PVM to launch and manage jobs across a wide diversity of supercomputers • highly fault tolerant and adapts to dynamic changes in the environment using PVM • next step deploy across ORNL, ANL, PNNL, SNL as a multi-site “Bio-Grid” • thousand of users for execution of genome analysis and simulation. Web portal PVM across Heterogeneous Supercomputers XML pathways genomes Protein analysis engine XML Raw data P4 Cluster 64 proc Cray X1 256 proc IBM p690 864 proc SGI Altix 256 proc
The GIST Developers really want Harness • They ask us regularly about the next generation of PVM called Harness because they want the increased adaptability and fault tolerance that Harness promises. • Harness is being developed by the same team that developed PVM: • Vaidy Sunderam – Emory University • Al Geist – Oak Ridge National Lab • Jack Dongarra – University of Tennessee and ORNL
Harness II Design Goals • Harness is a distributed virtual machine environment that goes beyond the features of PVM: • Allow users to dynamically customize, adapt, and extend a virtual machine's features • to more closely match the needs of their application • to optimize the virtual machine for the underlying computer resources. • Is being designed to scale to petascale virtual machines • distributed control • minimized global state • no single point of failure • Allows multiple virtual machines to join and split in temporary micro-grids
HARNESS II Architecture Daemon built on top of H2O kernel with DVM pluglet loaded Merge/split with other VMs Host A Host D Another VM Virtual Machine Host B Component based daemon Host C DVM FT-MPI Processes control Customization and extension by dynamically adding pluglets Operation within VM uses Distributed Control user features HARNESS daemon
Symmetric Peer-to-Peer Distributed Control Characteristics • No single point (or set of points) of failure for Harness. It survives as long as one member still lives. • All members know the state of the virtual machine, and their knowledge is kept consistent w.r.t. the order of changes of state. (Important parallel programming requirement!) • No member is more important than any other (at any instant) i.e. here isn’t a pass-around “control token” • For Petascale Systems the control members can be a distributed subset of all the processors in the system
Harness Distributed Control Control is Asynchronous and Parallel Supports multiple simultaneous updates Supports fast host adding add host Parallel recovery from multiple host failures Fast host delete or recovery from fault
HARNESS: Petascale Virtual Machine Variable Distributed Control Loop Size Virtual machine Size of the Control Loop 1 <= S <= (size of VM) For small VM and ultimate fault tolerance S = (size of VM) For large VM a random selection of a few hosts (f.e. S = 10) gives a balance of multi-point failure and performance. For S = 1, distributed control becomes simple client/server model.
Pluglet Kernel H2O kernel - Overview H2O is multithreaded lightweight kernel that is dynamically configured by loading “pluglets” Resources provided as services through pluglets. Services may be deployed by any authorized party: provider, client, or third-party reseller H2O is stateless and resources independent In Harness the DVM service, which includes distributed control of services, must be installed on host Pluglets can provide Multiple programming models Java and C implementations being developed Clients Functionalinterfaces [Suspendible] Pluglet FT-MPI PVM Java RMI Activeobjects OGSA P2P Programming models
E A B D F C H2O kernel H2O kernel – RMIX Communication H2O is built on top of a flexible P2P communication layer called RMIX • Provides interoperability between kernels and other web services • Adopts common RMI semantics • Designed for easy porting between protocols • Dynamic protocol negotiation • Scalable P2P design Java Web Services RPC clients H2O kernel SOAP clients ... RMIX RMIX Networking Networking RPC, IIOP, JRMP, SOAP, …
Registration and Discovery UDDI JNDI LDAP DNS GIS e-mail,phone, ... ... Publish Find ... Deploy Provider A nativecode A Deploy Client Provider Client Provider Client Provider Deploy A B A B B Reseller Developer LegacyApp Repository Repository A B A B C C H2O can support a wide range of distributed computing models Flexibility beyond the PVM/MPI model • Grid Web portal • Like • Genome Channel • Biology workbench • Web service • Internet Computing • Like • SETI at HOME • Entropia, • United Devices • Cluster computing • Like • PVM • Harness • LAM/MPI
Name Service Ftmpi_notifier MPI application MPI application libftmpi libftmpi Startup plugin Startup plugin H2O H2O Harness Fault Tolerant MPI Plug-in FT-MPI built in layers with tuned collectives, tuned derived data type handling and good point2point bandwidth. Works with MPE profiling and tools such as JUMPSHOT from ANL. Application performance on par with MPICH-2. FTMPI available SC2003
Harness Fault Tolerant MPI Plug-in FT-MPI is a system level Fault Tolerant full MPI 1.2 implementation. Process failures are detected & passed back to the users application using MPI objects. The users application decides how best to reconfigure the system and continue. Recovery Options for affected communicators: • ABORT: just do as other implementations i.e.checkpoint restart • BLANK: leave hole • SHRINK: re-order processes to make a contiguous communicator • REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD 1 2 3 4 5 6 7 8 9 1 2 X X 5 6 X 8 9 Communicator Options 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9
Large-scale Fault Tolerance Taking fault tolerance beyond checkpoint/restart. Developing fault tolerant algorithms is not trivial. Anything beyond simple checkpoint/restart is beyond most scientists. Many recovery issues must be addressed Doing a restart of 90,000 tasks because of the failure of 1 task, may be very inefficient use of resources. When and what are the recovery options for large-scale simulations?
Fault Tolerance – a petascale perspective Future systems are being designed with 100,000 processors. The time before some failure will be measured in minutes. Checkpointing and restarting this large a system could take longer than the time to the next failure! What to do? Autonomic? Self-healing? • Development of algorithms that can be naturally fault tolerant I.e. failure anywhere can be ignored? And still get the right answer. • No monitoring • No notification • No recovery • Is this possible? YES!
Progress on Super-scalar algorithms Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms local • Finite Difference (Christian Engelman) • Demonstrated natural fault tolerance w/ chaotic relaxation, meshless, finite difference solution of Laplace and Poisson problems • Global information (Kasidit Chancio) • Demonstrated natural fault tolerance in global max problem w/random, directed graphs • Gridless Multigrid(Ryan Adams) • Combines the fast convergence of multigrid with the natural fault tolerance property. Hierarchical implementation of finite difference above. • Three different asynchronous updates explored global
Further Information • www.csm.ornl.gov/~geist • Genomes to Life • Harness • Naturally Fault tolerant Algoritnms Questions?