180 likes | 342 Views
Under the Hood of a Workflow Manager. Matthew Shields, BiodiversityWorld GRID workshop , NeSC, 30 June - 1 July. T. n. a. r. i. a. Outline. What is Workflow management? Why should I care? Current State of the Art Workflow Languages Other Projects Triana, Architecture & Services
E N D
Under the Hood of aWorkflow Manager Matthew Shields, BiodiversityWorld GRID workshop, NeSC, 30 June - 1 July T n a r i a
Outline • What is Workflow management? • Why should I care? • Current State of the Art • Workflow Languages • Other Projects • Triana, Architecture & Services • Extending Triana for BDWorld • Conclusion
What is Workflow Management? • Concept comes from business world • Many years of research and practice • Process capture and reuse • Repeatability, provenance, audit trails & accountability • Domain expert knowledge capture • Analysis and optimization
What Can a Workflow Manager do for Me? • Scientific Workflow different focus to business • Large-scale data collection • Querying • Analysis • Visualization • Similar goals • Component & workflow reuse • Knowledge capture • Additional goals • Simplified application/experiment design • Environment/Complexity abstraction
State of the Art • Schedule workflow tasks (Grid/distributed environment) • Monitor/Control execution • Active visualization and computational steering • User interaction • Pause and restart • Data provenance • Component and sub-workflow reuse • Analysis and optimization
Workflow Languages • No current agreed standard • Most projects use DAG or Petri-Net • Data vs control flow • Dependency vs scripting language • Many XML schema • Business workflow standards - BPEL • Not good enough fit • GGF WFM-RG • Attempting to solicit agreement on standards
Workflow Management Projects • myGrid/Taverna - Southampton & others • XML/DAG based workflow language • Initially WS choreography tool - now incorporates local tools/components • Grid integration with databases via OGSA Distributed Query Processor • myGrid Project main users - Bioinformatics • Kepler - SDSC • Based on Ptolemy - modeling, simulation & design of real time & concurrent systems • Concurrent dataflow • Actors (components), Directors (workflow engines) • Local, Web Service & Grid Service actors • Ecology, biology, chemistry, oceanography, and the geosciences
WM Projects 2 • Karajan/Commodity Grid (CoG) Kit, Argonne & Berkerley • Scripting workflow language for Grid tasks • Integration with Globus Toolkit GT3 & GT4 • Pure control flow • Data flow performed by data tasks - GridFTP • And many more…See • http://www.gridworkflow.org/snips/gridworkflow/ • http://www.extreme.indiana.edu/swf-survey/
Triana • Cardiff University! PPARC funded • Java based Scientific Workflow Tool or PSE • Originally designed for Signal Processing • Now domain independent • Bioinformatics - obviously! • Signal Processing - gravitational wave detection & radio astronomy • Design optimisation • Data mining • Medical imaging • Distributed Audio Processing
Triana Components • Local Java components • Service-oriented Components • Web services as components (WSRF coming soon) • Web service workflow • Peer 2 Peer services as components • Distributed service workflow • Grid-oriented Components • Grid file and job primitives as components • Complex Grid workflow • Legacy code components via GridMonSteer • Mix and Match composition
Workflow • Inherently data flow based • control flow through “messages” • XML/DCG workflow format • Internally workflow language independent • Migration to standards based language • Simple Parent/Child relationship between tasks • Context based implied actions • Local file -> local file = file copy • Local file -> remote file = file transfer • Import/Export other workflow formats • Pegasus/EGEE read/write DAGMan format
Grid services Triana Architecture Service Based Computing: Grid Computing: Deployment, discovery and communication with distributed services e.g. P2P and (GSI) Web services Job Submission, File services A Graphical Grid Computing Environment or Portal GAP Interface GAT Interface P2PS JXTA Web Services Condor Unicore GridFTP GRMS WSRF Globus RLS PBS .NET GridLab P2PS Discovery UDDI JXTA Discovery SOAP P2PS Pipes SSH SGE LDR Other.. JXTA Pipes
Service Discovery Dynamic? Decentralized? Communication Message Format SOAP? Transport Protocol TCP? UDP? Triana in a SO World en_fr hello network bonjour BabelFish GAP babelfish. altavista. com
GAP Interface P2PS JXTA Web Services P2PS Discovery UDDI JXTA Discovery SOAP P2PS Pipes JXTA Pipes GAP Interface • A Simple Service based API, for • Service Deployment, • Service Discovery • Pipe Based Communication • Static application interface with multiple middleware bindings • P2PS • JXTA • Web services
WSPeer • High Level Interface to Web Services • Discovery • Invocation • Deployment • Hosting • Abstract from usual Web Service Discovery and Communication Mechanisms (i.e. UDDI and HTTP) • P2PS Web Service Discovery? • Uses Apache AXIS as SOAP Engine • Extends Capabilities of Apache AXIS • Stubless Invocation (including complex types) • Non Standard Transports (i.e. P2PS)
locate publish publish locate deploy deploy invoke invoke launch server UDDI HTTP Server WSPeer Application deploy publish locate invoke WSPeer – HTTP/UDDI WSPeer – P2PS
Extending Triana for BDWorld • BDWorld proxy components talk to Web Services • Workflow Design Assistant (WfDA) • selection and composition of BDWorld workflows from available services • Uses Meta Data Repository (MDR) & Meta Data Agent (MDA) • MDR contains mapping from proxies to resources • WfDA captures domain knowledge in constraints • Constraints used to limit the possible components at each stage of composition • Simplifies valid workflow creation
Conclusion • A workflow manager should: • Simplify scientific experimentation • Enable reuse at multiple levels • Component • Sub-workflow/Compund components • Collaboration • Abstract component and environment complexities • Think of all components as a service that performs a known task • Implied/Context based operations - file copy/move • Put the scientist back in control of the science, not the computing