230 likes | 332 Views
UC DAVIS Department of Computer Science. San Diego Supercomputer Center. Kepler/SPA Extensions for Scientific Workflows – Now and Upcoming. Ilkay Altintas SWAT lead San Diego Supercomputer Center altintas@sdsc.edu Bertram Lud ä scher Dept. of Computer Science & Genome Center
E N D
UC DAVIS Department of Computer Science San Diego Supercomputer Center Kepler/SPA Extensions for Scientific Workflows – Now and Upcoming Ilkay Altintas SWAT lead San Diego Supercomputer Center altintas@sdsc.edu Bertram Ludäscher Dept. of Computer Science & Genome Center University of California, Davis ludaesch@ucdavis.edu + many other SDM/SPA & Kepler contributors!
Ilkay Altintas SDM, NLADR, Resurgence, EOL, … Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN,ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• KEPLER/CSP: Contributors, Sponsors, Projects Ptolemy II Ptolemy II www.kepler-project.org LLNL, NCSU, SDSC, UCB, UCD, UCSB, UCSD, …, Zurich SPA Collab. tools: IRC, cvs, skype, Wiki: hotTopics, FAQs, ..
GEON Dataset Generation & Registration(a co-development in KEPLER) % Makefile $> ant run SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) Efrat (GEON) Ilkay (SDM/SPA) Yang (Ptolemy) Xiaowen (SDM/SPA) Edward et al.(Ptolemy)
Update: endo-SPA (exo-Kepler), endo-Kepler (exo-SPA), … w/o counting peas… • No/minor changes: • XSLT, email, … • Web service actor (SDM) • Updated: dynamic operation display, error reporting • Command line actor (SDM) • Updated: improved interface and error handling • SSH2 actor (SDM) • New: implements ssh2 protocol for remote execution (no plain password sent over the wire) • Timestamp actor (SDM) • New: for logging • BrowserUIv2.0 (SDM) • reimplemented, improved interface • v3.0 planned (“catching” http-get/post via localhost) • Execution logger (SDM) • New: workflow “black box” for keeping track of runs • Documentation framework (SDM) • Autogenerated actor documentation (new doclets and taglets) • Ontology-based actor and dataset classification (SEEK) • Finding relevant components: actors and datasets, suggesting possible connections, … • Kepler/SRB toolkit (GEON, SDM, SEEK, …) • improved interfaces, new functions • …
Application Pull vs Technology Push • Use case driven (application pull) • PIW, TSI-1, TSI-2, … • Solve technology issues along the way (+) solve the particular scientists’ problem (-) one-of-a-kind solutions, few generic & reusable technology Example: • TSI-1 and TSI-2 are conceptually almost identical scientific (“Grid/HPC/HTC”) workflows • but implemented very differently limited reuse, e.g., evolving/customizing one into the other is hard/impossible…
Application Pull vs Technology Push • Technology driven (technology push) • Generic application integration mechanisms: • web service actor, harvester, command-line actors, ssh2 actor, BrowserUI, … • Specialized interfaces to HPC/HTC systems: • Large-scale data management: • SDSC SRB toolkit (set of SRB actors), • SRM?, PVFS2?, MPI-IO?, … • Interfacing with generic job schedulers: • NIMROD, Condor, APST, … • Interfacing with scientific packages: • Statistics toolkit (R, …), GIS (Grass, ArcIMS, Mapserver…) • GAMESS toolkit, APBS (visualization)… (+) developing a reusable technology / toolkits (!) still need guidance by domain scientists’ problems, but need to lift one-of solutions into a general SWF engineering methodology
… creating prototype workflows and test cases (for automated tests) …
… putting them together in generic, reusable packages, e.g.Kepler/SRB toolkit SRB holdings @ SDSC only: 404 TB in 59 million files across 5167 users (12/16/’04, Reagan Moore)
KEPLER/R Toolkit (under development) Source: Dan Higgins, Kepler/SEEK
Ontology-based Actor & Dataset Discovery Ontology based actor (service) and dataset search Result Display
Example: GAMESS Quantum-mechanics cheminformatics workflow • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics)
Technology-oriented meeting: May 12th Ptolemy/Kepler Miniconference in Berkeley
What’s needed, what’s next • Build generic toolkits / packages • Don’t reinvent – Reuse! • Improved R coupling, SCIRun coupling, … • SWF Framework that lets scientists choose… • SRB (Sput, Sget,…), SRM, MPI-IO, GlobusTK (GridFTP,…) , Sabul, …, pNetCDF, parallel-R, … packages • Condor, Nimrod, … schedulers • GRASS, … • General purpose SWF system/PSE that scientists can use themselves
Towards a KEPLER School of Expression (Flow-based Design Patterns) • Generality vs specialization of actors • also loosely coupled vs tightly coupled • Data transformation pipelines • alternate compute and data transformation steps • Stage-execute-fetch pattern (Grid/HPC/HTC-WFs) • Loops, higher-order functions (map, foldr, …) • cf. Taverna’s automatic loop insertion based on data types • JDBC/SRB connection tokens, proxies, certificates connect A B C methods functions f [f1,f2, …fn] F-map producer [f(x)1,…,f(xn)] producer map [x1,x2, …xn] X
Kepler@UC Davis Genome Center: Scientific Workflows to Support the Complete (Wet-lab) Experiment Lifecycle • Try to capture and (semi-)automate the Experiment Lifecycle: • Discover similar experiments, … • reuse, customize, • execute, monitor, • manage results, • Register back to an experiment repository • Support Experiment Design, Execution, & Reuse • Scientific workflows and semantic extensions (ontologies, metadata++)
Summary: What we could/should do • Push technology: • Distributed Kepler & “detached” execution • Making Kepler more X-aware, where … • … X=Data plumbing (SRB toolkit, GridTK, others, …) • … X=Grid & Scheduling (need a “Grid director”? Condor director?), • … X=Parameter-sweep (“Nimrod/APST”… director?) • … X=Statistics & other specialized packages (R, parallel-R?, …, Grass, … ) • … X=Visualization (SciRUN, …) • Semantic extensions • Actors and datasets have “semantic types” to support reource discovery, WF design, … • Create “Packages” or “Rolls” • … targeting certain scientific user groups & communities • SWF Life-cycle support: • Design, execution, monitoring, archival, re-use/re-run • Design patterns, “Kepler School of Expression”