200 likes | 308 Views
NDGF CO2 Community Grid. Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008. Topics. Project overview First use case Requirements and architecture Implementation Experiences Statistics Future. CO2-CG overview.
E N D
NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008
Topics • Project overview • First use case • Requirements and architecture • Implementation • Experiences • Statistics • Future
CO2-CG overview • NDGF Community Grid (CO2-CG) project aims to build an application environment for scientists studying CO2 sequestration • CO2-CG was selected in NDGF call for community projects along with BioGrid • NDGF provides project coordinator and half FTE for application grid integration plus funding for full FTE for community software development • One year project, started in fall 2007 • Project coordinator: Michael Gronager (NDGF) • Project leader: Klaus Johannsen (BCCS, Bergen) • Science specialist: Philip Binning (DTU, Copenhagen) • Software developer: Csaba Anderlik (BCCS, Bergen) • Grid specialist: Olli Tourunen (NDGF)
First use case for CO2-CG • Parameter study of different attributes of potential CO2 sequestration reservoirs • Software: MUFTE-UG, a general purpose simulator for multi-phase, multi-component flow in porous media • Pilot user: Andreas Kopp (University of Stuttgart)
First use case (contd.) • Order of hundreds of 32 to 64 processor parallel simulations, computationally bound (not data intensive) • One simulation covers a time frame of approximately 50 years starting from CO2 injection to the reservoir • Why parallel? Isn’t this a parameter study after all? • A single 32 process run typically takes 3-4 days to complete • With 16 processes we might be running over a week • Resources for these simulations are provided mainly by NOTUR, the Norwegian national infrastructure for computational science
Requirements • Main target: Provide scientists with transparent access to computational resources in the grid • Input: User’s working directory containing MUFTE-UG source code and the input files for the simulation • Output: Simulation results returned to the user in a user specified directory • Support for NorduGrid ARC middleware • Standard grid credential handling to avoid need for custom security policies with participating sites
Cluster A Cluster B Supercomputer C MUFTE Runtime Environment MUFTE Runtime Environment MUFTE Runtime Environment Architecture overview Application server S S A R C S S Command Line UI DB Job descr 1 Job descr 2 Job descr 3 Grid Job Manager R R R S S Software Results R
Architecture • Command line UI (application server) • Introduces one keyword ‘grid’ which can be invoked with different options á la openssl • Example: • user prepares the source code and input files for a simulation in a directory of her choice • User issues command like ‘grid submit –np 32’ • The submit module packages the simulation directory into the spool directory and inserts the parameters into the database • User tracks the progress by running ‘grid status’ • The results are made available to user when the job finishes
Architecture (contd.) • Grid Job Manager (application server) • Scans the database for new jobs • Prepares the new jobs for grid based on job parameters • Submits the jobs into grid • Keeps track of the grid jobs • Downloads the results when a job is ready • Downloads the evidence for autopsy when a job fails • MUFTE Runtime Environment (grid resource) • Standard ARC Runtime Environment • Compiles the software based on local configuration and environment • Runs the simulation
Implementation • Grid Job Manager (GJM) • There is one GJM instance per user • “One sweep at a time” -job, intended to be launched from cron • Runs under user credentials • Spools active jobs in /var/spool/co2-cg/<user> • Written in Python • Uses an object-RDB –mapper called SQLAlchemy • Interacts with ARC grid middleware through standard user commands • Python API for ARC is also available, might use that in the future
Implementation (contd.) • Database • Standard PostgreSQL relational database • 3 main tables plus some auxiliary ones • Runtime Environment • Compilation is done on the ARC server host before the job is submitted using user’s credentials • Compilation and execution parameters are based on the job attributes in the DB • Supported levels of parallelism are encoded in the RE name (e.g. MUFTE-MPI-64-1.0)
Challenges • Transparent grid credential handling • Balance the security policies and ease of use • Parallel run parameterization • User needs vs. types of resources vs. available resources • No explicit brokering support for this in ARC • This can be done with clever RE naming • Database access right management (not really an issue until this goes to bigger scale) • Lots of different possibilities to solve this if needed (DB level access rights, per user tablespaces, row change staging, n-layer architecture outside the DB…). So far applied KISS.
Experiences: User side • User can access a significant number of distributed resources in a transparent manner • Peak so far: 512 cores simultaneously in use • Problems • Memory specifications for the jobs • Walltime specifications for the jobs • Getting all the information to debug the jobs that have crashed • Non-converging jobs
Experiences: Operator side • It takes around a day setup the MUFTE RE in a new cluster • If the site has experience in running MPI-jobs through ARC, the process is quite straightforward • In one case we have also had to set up a cross compiling facility • AA is easy to configure since the users are managed in NDGF VOMS • Since there are not that many parallel jobs run in the grid, ARC LRMS interface needed some tweaks in some clusters • Thanks for all the sysadmins that have helped us along the way!
Statistics • Since February 12th 2008, over 400 simulations of 16 to 64 processors have been run • Total compute time around 230000 hours • Disclaimer: measurements done from the application server side, not from resources accounting.
Future developments • Switching focus to operation • Software and application server hardening • Automated tests for the runtime environments + blacklisting • Cleanup procedures • Integrate CO2-CG into the NDGF accounting system • Track the simulations that are not converging • Easier certificate handling • Possibly a web portal for job tracking and collaboration • Include the new Cray XT4 in Bergen
Conclusions • With moderate effort, simple tools and application specific user interface the grid resource usage can be made easy for the end users • On-demand compilation works for selected applications • Parallel jobs can be run in a large scale in the grid with little effort
Thank you! Questions, comments?