340 likes | 523 Views
GridLab WP2: CGAT Cactus Grid Application Toolkit. Gabrielle Allen GridLab/Cactus. Max Planck Institute for Gravitational Physics (AEI). WP2: CGAT. Making use of GAT within the Cactus framework Grid-enabling applications using Cactus Devising and implementing scenarios
E N D
GridLab WP2: CGAT Cactus Grid Application Toolkit Gabrielle Allen GridLab/Cactus Max Planck Institute for Gravitational Physics (AEI)
WP2: CGAT • Making use of GAT within the Cactus framework • Grid-enabling applications using Cactus • Devising and implementing scenarios • Testing GridLab services and tools on multiple testbeds, with “real” applications • User requirements • Interacting/disseminating with different groups using Cactus to make them aware of Grid and GridLab GGF7, 2003
… a framework for HPC applications Open source Modular (flesh and thorns) Portable Collaborative Provides parallelism, IO, toolkits, … Generic applications Nothing to do with the Grid, but by design very well suited for use on the Grid … … and our main users (e.g. Denis) want/need the services the Grid will provide Cactus: www.cactuscode.org GGF7, 2003
Cactus User CommunityUsing and Developing Physics Thorns Numerical Relativity Other Applications AEI Southampton Wash U RIKEN Chemical Engineering (U.Kansas) Goddard Penn State Thessaloniki Bio-Informatics (Canada) Tuebingen TAC SISSA Portsmouth EU Astrophysics Network Austin UNAM LSU Early Universe (LBL) Brownsville Pittsburg New EU Astrophysics Network ??? Astrophysics (Zeus) Climate Modeling (Ultrecht, NASA,+) CFD (KISTI, LSU) Crack Prop. (Cornell) GGF7, 2003
Black Hole simulations using the Cactus framework (Typical: 50GB, 600 TeraFlops, 1TB output, 50hrs, 15000SUs) Numerical Relativity Simulations performed at NERSC/NCSA by the AEI numerical relativity group Visualization by Werner Benger, ZIB GGF7, 2003
Grid-Cactus Development TeraGrid (distributed runs, Visapult) GrADs project (Also using Cactus) GriKSL (Data/Visualization) Cactus Development Team (Adding needed infrastructure) GridLab (GAT, services, scenarios, implementation) NumRel/EU Users (Ideas and Testing) MetaCactus (DFG Proposal) ASC Project (Ending this year) GGF7, 2003
WP2: CGATCactus/GAT Integration Cactus Flesh Thorn Thorn GAT Library CGAT Thorn Thorn Thorn Thorn Cactus GAT wrappers Additional functionality Build system Physics and Computational Infrastructure Modules GridLab Service GridLab Service GGF7, 2003
Grid-enabled Cactus Apps • Generic Cactus framework e.g. • Checkpointing • Portability • Flexible make system • Switchable parallel layers • Steering/control API and interfaces • Socket layer • and integration of Grid services with the GAT means that all Cactus applications are trivially grid-enabled. GGF7, 2003
Larger computational resources Memory/CPU Faster throughput Cleverer scheduling, configurable scheduling, co-scheduling, exploitation of un-used cycles Easier use of resources Portals, grid application frameworks, information services, mobile devices Remote interaction with simulations and data Notification, steering, visualization, data management Collaborative tools Notification, visualization, video conferencing, portals Dynamic applications, New scenarios Grid application frameworks connecting to services What do our users want? GGF7, 2003
Dynamic Staging move to faster/cheaper/bigger machine Multiple Universe create clone to investigate steered parameter Automatic Convergence Testing from initial data or initiated during simulation Look Ahead spawn off and run coarser resolution to predict likely future Spawn Independent/Asynchronous Tasks send to cheaper machine, main simulation carries on Application Profiling best machine/queue choose resolution parameters based on queue Dynamic Load Balancing inhomogeneous loads multiple grids Portal User/virtual organisation interface to the grid. Intelligent Parameter Surveys farm out to different machines Make use of Running with management tools such as Condor, Entropia, etc. Scripting thorns (management, launching new jobs, etc) Dynamic use of eg MDS for finding available resources Application Scenarios GGF7, 2003
Motivation for GAT Why do applications need a framework for using the Grid? We (application developers) need a layer between applications and grid infrastructure: • Higher level than existing grid APIs, hide complexity, abstract grid functionality through application oriented APIs • Insulate against rapid evolution of grid infrastructure • Choose between different grid infrastructures • Make it possible for grid developers to develop new infrastructures • Make it possible for application developers to use and develop for the grid independent of the state of deployment of the grid infrastructure GGF7, 2003
Varied applications deployed of the GGTC testbed Cactus Black Hole Simulations ASC Portal Smith-Waterman Nimrod-G Task Farming scenario Visapult Highlights GGTC won 2 of the 3 HPC Awards Won (with Visapult/LBL group) Bandwidth Challenge $2000 prize money to UNICEF childrens fund SC2002, Baltimore GGF7, 2003
Global Grid Testbed Collaboration (GGTC) • Driven by GGF APPS and GridLab testbed and applications • Whole testbed constructed very swiftly (few weeks) • 5 continents: North America, Europe, Asia, Africa, Australia • Over 14 countries, including: China, Japan, Singapore, S.Korea, Egypt, Australia, Canada, Germany, UK, Netherlands, Czech, Hungary, Poland, USA • About 70 machines, with thousands of processors (~7500) • Many hardware types, including PS2, IA32, IA64, MIPS, IBM Power, Alpha, Hitachi/PPC, Sparc • Many OSs, including Linux, Irix, AIX, OSF, True64, Solaris, Hitachi • Many different organizations (big centers/individuals) • All ran same Grid infrastructure! (Globus) GGF7, 2003
Global Grid Testbed Collaboration GGF7, 2003
Myproxy/GRAM/MDS/GridFTP/GSI-SOAP Start jobs GRAM, GRMS (OGSA) Move/browse files GridFTP Track and monitor announced jobs Connect to simulation web interfaces for steering and viz Access to Grid New framework based on portlets: www.gridsphere.org User Portal GGF7, 2003
Notification Running Appli cations “TestBed” SMS Server Portal Server Mail Server GGF7, 2003
OpenDX, Amira, … HDF5 GridFTP VFD Stream VFD Remote Data Visualization Tool Hyperslabbing, Downsampling IOStreamedHDF5 GridFTP Remote Data Server Simulation GGF7, 2003
Distributed simulations using Cactus, Globus and Visapult With John Shalf/LBL and others 16.8 Gigabits/second scinet.supercomp.org/bwc Six sites: USA/Dutch/Czech Bandwidth Challenge:Highest Performing Application GGF7, 2003
Task Farming on the Grid TFM implemented in Cactus TFM GAT (GRAM, GRMS) used for starting remote TFMs TFM TFM Designed for the Grid TFM TFM fork/exec Tasks can be anything GGF7, 2003
Task Farming Motivation • Requested by local physics group • Parameter surveys, e.g. looking for critical phenomena in gravitational wave collapse by varying amplitude, testing different formalisms of Einstein Equations for evolving same initial data • Scenario is inherently quite robust and fault tolerant • Good migration path to the Grid • Start easy (not too much Grid!), task farm across local homogeneous workstations and on single supercomputers. • Use public keys first, then test standard Grid infrastructure • Use of GAT then means users can start testing GridLab services (should still work for them if services not ready) • CGAT team can then test real physics runs using wider Grid and GridLab services. GGF7, 2003
Task Farming on the Grid Generic Part Application Specific GGF7, 2003
Grid-xclock Simple application for testing and debugging. xclock is standard X utility, run on any machine with X installed • Requires: • xclock binary • X libraries • To display remotely, need to open outgoing ports from machine it is running on to machine displaying GGF7, 2003
Grid-Black Holes • Task farm small Cactus black hole simulations across testbed • Parameter survey: black hole corotation parameter • Results steer a large production black hole simulation • Now push to bring this to physics userbase and incorporate GridLab services • Requires: • Black hole binary • C/Fortran/MPI libraries • How to run MPI jobs on a known set of nodes • To contact Steering server, need to open outgoing ports from machine it is running on to server GGF7, 2003
What we did … • Need a Cactus black hole (MPI/Fortran) binary on each machine • Login interactively to each machine (gsissh) • Set up standard user environment (paths, scratch space, …) • Install Cactus and utilities in standard location (e.g. $HOME/binary/cactus_blackhole) • Test executable runs in usual login environment GGF7, 2003
Testbed Problems • Organization • People working in the testbed collaboration not always in close contact with local administrators/policy makers • General coordination and status reporting of 70 machines • Accounts • Local policies for creating accounts differ • Basically no way to create limited access/use accounts for us • Different resources available: e.g. file spaces, inodes • Lack of access via gsissh a big problem with many machines, requiring lots of coordination with administrators • Really need group accounts for such an endeavor (e.g. CAS) • Needed some gymnastics with gridmap files (existing accounts) GGF7, 2003
Testbed Problems • Machines • Resources at main centers usually well documented • (although Grid software, installations and support usually not documented) • Other resources not usually documented, need to find compilers, scratch space etc. • Local changes to “standard” queuing systems etc • Setting up user environment • A few machines have rather strange set ups • Firewalls • Many machines firewalled in different ways. • Need a lot of lobbying at big centers to open needed ports • Often ports only opened to specific addresses (hard for demoing in Baltimore) GGF7, 2003
Testbed Problems • Application Installation • MPI is sometimes hard to use (many different implementations, LAM, MPICH, ScaliMPI, Native, …) • Even with very portable applications initial compilation and testing can be very time consuming • Need robust tools to help with this e.g. GridMake (AEI) • Grid Installations • Not well (or at all) documented • Different versions and patches • Local tweaks to installations • Firewalls can change even daily • Functioning of software can change even daily!! • Incomplete installations (e.g. no gsissh) • Certificates • Various problems with all the different machine and user certificates GGF7, 2003
Testbed Problems • Globus Infrastructure • Main problems with Globus are with deployment • Proxy delegation • Start a run, get a limited proxy which can’t be used to start another run • Setting user environment for deploying applications • MPI runs set up different environments on different processors? • Xclock not on standard path • X libraries not on standard library path GGF7, 2003
Deployment of Applications • To run any application need correct user environment • Path to any executables • Home directory and other directories • Location of needed libraries, X, C, Fortran, MPI • Could be many others depending on the application • Note that machines typically have multiple compilers, MPI installations … have to use correct ones for a given executable • In usual interactive use of machines many of these are set in e.g. user’s .cshrc • Globus user environment • Starting jobs with Globus only provides a minimal environment • Rationale is that resources are not used interactively, correct environment should be passed in from outside • RSL syntax provides way to pass in requested environment GGF7, 2003
Deployment of Applications • For our current use of resources this is a real problem. • Even though you can pass in user environment how do you get the correct values for a given machine (it isn’t on MDS now). • How do you get the correct executable in the first place? • Could provide statically linked executables (executable repository) but still need to provide them at least for each machine, each OS version, each MPI/F90 combination. • Applications will need to provide a list of which variables need to be set to be run (standard way to specify this?) • Do we need a Grid equivalent of “modules” functionality (module load gnu, module load mpi-mpich) GGF7, 2003
Deployment of Applications • Frustrating right now, because user environment is usually correctly set for interactive use, but how can we make use of this in a grid environment? • Use globusrun to invoke the correct interactive shell on any machine? E.g. run “csh –csh” • In practice this worked • Around 35 machines worked for grid-xclock • Only 7 machines worked for grid-blackholes (MPI/Fortran) • Currently investigating why it didn’t fully work by comparing environment obtained on a machine when entering in different ways • Machines not consistently set up? • Environment passed in inconsistant manner to all processors? GGF7, 2003
MPI/Fortran Applications • Require many more details about environment • Location of MPI/Fortran libraries for a particular compiler and MPI implementation • Problems with interpretation of RSL keywords on some machines • Wanted to be given a set of processors on which the TFM would start different MPI task • E.g. jobtype=“single”, count = 4 would sometimes start up 4 versions of the TFM instead of a single TFM in control of 4 processors • How can you tell which processors you were actually allocated? • On clusters the TFM typically needs this information in order to start MPI runs with a machines file GGF7, 2003
Lessons Learnt from SC2002 • Need to really think about the design of scenarios for the Grid (firewalls, NAT/internal cluster nodes, environment) • Need more communication of requirements and problems with infrastructure developers (GridFTP, Globus, RSL) • Real testbeds and real applications are crucial! (70 GGTC testbed machines, 35 “worked” with Grid x-clock, 7 “worked” with Grid black holes [Fortran/MPI]) • Need to think more about compute resources • General machine setup (environment) • Deployment of Grid software • Intermachine connectivity (firewalls, NAT, IPv6?) • Need reliable Grid tools: Testbed tests/status, gridmake (AEI), grid debuggers, grid profilers. GGF7, 2003
Summary • Lots of problems with running real applications on todays machines with todays Grid infrastructure • This is what GridLab is addressing Co-development of applications and infrastructure on a real testbed • GAT will allow us to develop our applications ready for the Grid • Applications can still run as they do today, but can test/make use of (anyones) services as they are ready • Allows us to simultaneously work with our resources to also make them ready for the Grid GGF7, 2003