NDGF CO2 Community Grid

NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

Topics • Project overview • First use case • Requirements and architecture • Implementation • Experiences • Statistics • Future

CO2-CG overview • NDGF Community Grid (CO2-CG) project aims to build an application environment for scientists studying CO2 sequestration • CO2-CG was selected in NDGF call for community projects along with BioGrid • NDGF provides project coordinator and half FTE for application grid integration plus funding for full FTE for community software development • One year project, started in fall 2007 • Project coordinator: Michael Gronager (NDGF) • Project leader: Klaus Johannsen (BCCS, Bergen) • Science specialist: Philip Binning (DTU, Copenhagen) • Software developer: Csaba Anderlik (BCCS, Bergen) • Grid specialist: Olli Tourunen (NDGF)

First use case for CO2-CG • Parameter study of different attributes of potential CO2 sequestration reservoirs • Software: MUFTE-UG, a general purpose simulator for multi-phase, multi-component flow in porous media • Pilot user: Andreas Kopp (University of Stuttgart)

First use case (contd.) • Order of hundreds of 32 to 64 processor parallel simulations, computationally bound (not data intensive) • One simulation covers a time frame of approximately 50 years starting from CO2 injection to the reservoir • Why parallel? Isn’t this a parameter study after all? • A single 32 process run typically takes 3-4 days to complete • With 16 processes we might be running over a week • Resources for these simulations are provided mainly by NOTUR, the Norwegian national infrastructure for computational science

Requirements • Main target: Provide scientists with transparent access to computational resources in the grid • Input: User’s working directory containing MUFTE-UG source code and the input files for the simulation • Output: Simulation results returned to the user in a user specified directory • Support for NorduGrid ARC middleware • Standard grid credential handling to avoid need for custom security policies with participating sites

Cluster A Cluster B Supercomputer C MUFTE Runtime Environment MUFTE Runtime Environment MUFTE Runtime Environment Architecture overview Application server S S A R C S S Command Line UI DB Job descr 1 Job descr 2 Job descr 3 Grid Job Manager R R R S S Software Results R

Architecture • Command line UI (application server) • Introduces one keyword ‘grid’ which can be invoked with different options á la openssl • Example: • user prepares the source code and input files for a simulation in a directory of her choice • User issues command like ‘grid submit –np 32’ • The submit module packages the simulation directory into the spool directory and inserts the parameters into the database • User tracks the progress by running ‘grid status’ • The results are made available to user when the job finishes

Architecture (contd.) • Grid Job Manager (application server) • Scans the database for new jobs • Prepares the new jobs for grid based on job parameters • Submits the jobs into grid • Keeps track of the grid jobs • Downloads the results when a job is ready • Downloads the evidence for autopsy when a job fails • MUFTE Runtime Environment (grid resource) • Standard ARC Runtime Environment • Compiles the software based on local configuration and environment • Runs the simulation

Implementation • Grid Job Manager (GJM) • There is one GJM instance per user • “One sweep at a time” -job, intended to be launched from cron • Runs under user credentials • Spools active jobs in /var/spool/co2-cg/<user> • Written in Python • Uses an object-RDB –mapper called SQLAlchemy • Interacts with ARC grid middleware through standard user commands • Python API for ARC is also available, might use that in the future

Implementation (contd.) • Database • Standard PostgreSQL relational database • 3 main tables plus some auxiliary ones • Runtime Environment • Compilation is done on the ARC server host before the job is submitted using user’s credentials • Compilation and execution parameters are based on the job attributes in the DB • Supported levels of parallelism are encoded in the RE name (e.g. MUFTE-MPI-64-1.0)

Challenges • Transparent grid credential handling • Balance the security policies and ease of use • Parallel run parameterization • User needs vs. types of resources vs. available resources • No explicit brokering support for this in ARC • This can be done with clever RE naming • Database access right management (not really an issue until this goes to bigger scale) • Lots of different possibilities to solve this if needed (DB level access rights, per user tablespaces, row change staging, n-layer architecture outside the DB…). So far applied KISS.

Experiences: User side • User can access a significant number of distributed resources in a transparent manner • Peak so far: 512 cores simultaneously in use • Problems • Memory specifications for the jobs • Walltime specifications for the jobs • Getting all the information to debug the jobs that have crashed • Non-converging jobs

Experiences: Operator side • It takes around a day setup the MUFTE RE in a new cluster • If the site has experience in running MPI-jobs through ARC, the process is quite straightforward • In one case we have also had to set up a cross compiling facility • AA is easy to configure since the users are managed in NDGF VOMS • Since there are not that many parallel jobs run in the grid, ARC LRMS interface needed some tweaks in some clusters • Thanks for all the sysadmins that have helped us along the way!

Statistics • Since February 12th 2008, over 400 simulations of 16 to 64 processors have been run • Total compute time around 230000 hours • Disclaimer: measurements done from the application server side, not from resources accounting.

Statistics (contd.)

Future developments • Switching focus to operation • Software and application server hardening • Automated tests for the runtime environments + blacklisting • Cleanup procedures • Integrate CO2-CG into the NDGF accounting system • Track the simulations that are not converging • Easier certificate handling • Possibly a web portal for job tracking and collaboration • Include the new Cray XT4 in Bergen

Conclusions • With moderate effort, simple tools and application specific user interface the grid resource usage can be made easy for the end users • On-demand compilation works for selected applications • Parallel jobs can be run in a large scale in the grid with little effort

Thank you! Questions, comments?

NDGF CO2 Community Grid

NDGF CO2 Community Grid

Presentation Transcript

CO2 Mineralization

Co2 Dragster

Nordic Data Grid Facility NDGF – www.ndgf.org

World Community Grid (WCG)

CO2 Dragster

CO2 Cars

CO2 Cooling

CO2 Workshop

NDGF SC3 performance

Co2 Offset

The German HEP Community Grid

NDGF SAM Tests Status

World Community Grid Volunteer Study

World Community Grid Volunteer Study

CO2 Sensor