300 likes | 308 Views
Activities of the COST D37 GridChem Computational Chemistry Workflow Group. EGEE'07 Conference Budapest 01.10.2007. Partners in the CCWF Working Group. København. Thomas Steinke, Tim Clark (DE) Hans-Peter Lüthi, Martin Brändle (CH) Peter Murray-Rust , Henry Rzepa (UK)
E N D
Activities of the COST D37 GridChemComputational Chemistry Workflow Group EGEE'07 Conference Budapest 01.10.2007 Thomas Steinke Zuse Institute Berlin (ZIB) <www.zib.de> steinke@zib.de
Partners in the CCWF Working Group • København • Thomas Steinke, Tim Clark (DE) • Hans-Peter Lüthi, Martin Brändle (CH) • Peter Murray-Rust, Henry Rzepa (UK) • Antonio Márquez(ES) • Kurt Mikkelsen (DK) • CSCS (Manno, CH) • ZIB (Berlin, DE) Cambridge • Berlin • London • Erlangen Zürich • Manno • Sevilla
“Traditional” Workflow in Computational Chemistry in the 80’s – 90’s Workflows have a long tradition in the CC domain. start knowledge base (DB search) automated/manually edited molecular structures molecular simulations method / program A method / program B … properties primary visualization / quality control analysis / archival / DB storage new insights?
Databases: Computational protocol (T. Clark, 1998) • Complete protocol runs automatically with less than 0.5% failure rate. • Cleanup • 2D 3D conversion • VAMP optimization • Calculate properties • ~3,000 compounds per processor day (3 GHz Xeon) Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized Structures B. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217. source: Tim Clark, Uni Erlangen
Distributed Computing Environment in the 90’s QM packages
Distributed Computing Environment in the 90’s Example: UniChem distributed environment for quantum-chemical simulations Cray Research Inc. 1991-(2004)
CCWF Chemical Illustrator Applications • Molecular design of functionalised enzynes Hans-Peter Lüthi, Martin Brändle, Zürich Peter Murray-Rust, Cambridge; Henry Rzepa, London • Quantum chemical based QSAR/QSPR Tim Clark, Erlangen; Jon Essex, Southampton • High-order dynamic and static electrostatic molecular properties Kurt Mikkelsen, Copenhagen • Computational heterogeneous catalysis Antonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla
QC Input QC Application QC Output Parser DB XPath Query XML XSLT Input Statistical Analysis Output Molecular Design Workflow (Enzyne Design) Steps: • Generation and Archiving of data • ExtractionXPath queries • Statistical Analysis source: Hans-Peter Lüthi, ETH Zürich
Quantum Chemical Based QSAR and QSPR • generate structures,conformations and protonation states • semiempirical MO geometry optimization and electron density • generate isodensity surfaces, spherical-harmonic fits and local properties • apply models 2D-Database 2D 3D Conformations, Tautomers QSPR VAMP Materials Design Virtual Screening ParaSurf Multiscale Modeling ADME/Tox. Pharmacokinetics Property Optimization Molecular Info source: Tim Clark, Uni Erlangen
Properties: Free Energies of Hydration N = 362 MUE = 0.85 kcal mol-1 RMSD = 1.09 kcal mol-1 r2 = 0.88 q2 = 0.83 source: Tim Clark, Uni Erlangen
Computing the NCI database (P. Murray-Rust, ’05) MOPAC PM5 Workflow built with Taverna source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Times to run jobs source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Unsuitable Data Program Crashes Pathological Behaviour Inform Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Conclusions from NCI “Experiment” (2005) • Protocols can be automated • Machines can highlight unusual behaviour, geometries and distribution of results for humans to consider • Computational programs can provide high quality “experimental” molecular properties source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute
Motivation • The orchestration of complex workflow scenarios is on today’s agenda. • complex scientific solution paths • linking in-house and (commercial) legacy codes Transformation of scientific ventures into a scientifically validated protocol • allowing a highly (semi-) automated data generation (pre-processing) and data processing steps.
Goals of the CCWF Working Group • implementation of workflow environments for QC by adapting standard (Grid) technologies • fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology. • implementation of computational chemistry illustrator scenarios to demonstrate the applicability of our approach
Generic Workflow • Automatic generation + validation of input data • Submission, monitoring, and gathering of output data of simulation jobs • Integration of results (primary data) into project database • Data mining and visualization techniques to reduce complexity • Knowledge generation by applying methods of statistical analysis and pattern recognition. • On-line publication and archiving of valuable scientific data.
Challenges • Diversity: Molecular properties derived from state functions obtained with electronic-structure methods. • ab-initio, semi-empirical, DFT, approximate potentials • Gaussian, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD… • Data formats: How to implement seamless data export/import? • ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, … OpenBABEL
Challenges (cont.) • Scaling, Robustness, Load Balancing: I can handle O(10) jobs by hand but… what about campaigns of O(1000) of jobs? • workflow system • computational resources distributed computing • persistence, automated failure recovery, … • long simulation times, sometimes unpredictable • Acceptance: • easy of use, GUI + CLI
What I Want… • easy-of-use: • workflow orchestration • usage • installation / maintenance • sharing of workflow descriptions with my colleagues • standard languages • support in a heterogeneous environment • laptop – server – cluster – supercomputer – grid
Which Workflow System? … to be spoilt for choice?
workflows in distributed systems supported batch systems: PBS (, LSF) support for managing large files recovery / backup quality of the documentation customizability PKI / security required installation effort Web interface WF language robustness, stability Grid environment open source restart/stop/debugging user/installation base status & exception handling legacy codes and Web services project development activity GUI Some Assessment Criteria
workflow orchestration integration of web services semantic check of WSDL files support for self-written Triana modules negligible control logic overhead pre-requisite for migration to Grid environments proprietary workflow description language in TRIANA (BPEL is announced) GUI robustness for very complex workflow definitions TRIANA Experiences (2005/06)
integration of web services and legacy codes monitoring + debugging support Grid environments under active development (A. Hoheisel et al./FhG FIRST) workflow orchestration (WF GUI builder in preparation) proprietary workflow description language GWES Experiences (MediGRID, since 2006)
Workflows language: BPEL (Active BPEL) WF editor (Eclipse) Web Services customization Jobs submission & monitoring via WS job manager API persistent (job recovery), in-memory (via Hibernate) Distributed Resource Management (DRM) Condor-G, Globus Gram SSH-exec your own plug-ins, e.g. PBS OMII Server: Attracting Features Data • GridSAM file staging support • within job (JSDL): file stage in/out • Apache Virtual File System library (vfs) • FTP, local files, http, http, ssftp • zip, jar, tar, bzip2, gzip • ram - data in memory • GridFTP
workflow orchestration (Eclipse plugin) standardized WF language monitoring support Grid environments security features: https + signed messages (X.509 cert.) active development (UK eScience) deployment requires manual workarounds learning barrier (BPEL) BPEL editor not fully mature (validation of BPEL workflows) OMII/Active BPEL Experiences (3 months)
Summary • there are a couple of workflow system available • design/development of workflow system still an on-going research • not yet decided for our working group • barriers: easy to use vs. robustness • middleware stack: more complicated Grid environments vs. script-based approaches on clusters • standards vs. proprietary but powerful/sufficient WF languages • BPEL has a high chance to survive
Acknowledgement Core members of D37 CCWF working group • Hans-Peter Lüthi, ETH Zurich • Tim Clark, CCC Uni Erlangen • J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni Cambridge/Unilever Inst. • developer of workflow systems mentioned in this talk