230 likes | 416 Views
Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid. Wei Tan 1 , Paolo Missier 2 , Ravi Madduri 1 , Ian Foster 1. foster@mcs.anl.gov http://www-fp.mcs.anl.gov/~foster/. 1 University of Chicago and Argonne National Laboratory, USA
E N D
Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid Wei Tan1, Paolo Missier2, Ravi Madduri1,Ian Foster1 foster@mcs.anl.gov http://www-fp.mcs.anl.gov/~foster/ 1 University of Chicago and Argonne National Laboratory, USA 2 School of Computer Science, University of Manchester, Manchester, U.K
Agenda • Introduction to caGrid • Why scientific workflows in caGrid? • BPEL and Taverna comparison • Service discovery • Service composition & workflow execution • Data-driven vs. control-driven modeling • Implicit vs. explicit definition of data • Implicit vs. explicit iteration on data • Workflow result analysis • Conclusion
Globus Introduction: caBIG and caGrid
As of Oct19, 2008: 122 participants 105services 70data 35 analytical
Introduction: caGrid and workflow Scientific workflow lifecycle Composition Discovery instruments reuse Community data Execution generate Connectivity Analysis Virtualization Security caGrid computation resource
Challenges faced by caGrid users Composition Discovery • Locating needed services • Determining function • Accessing services from a workflow • GUI for building workflows easily • Persisting and visualizing results Community • Executing workflow efficiently reuse Execution Analysis generate Sharing and reusing workflows caGrid 6
Our goals in this paper • Communicate practical experiences based on our work in the caGrid project • Cover the entire scientific workflow lifecycle, from service discovery to service composition, workflow execution, and workflow result analysis Based on caGrid requirements for workflow language and tooling Also applicable to other areas in data-intensive and exploratory science?
BPEL and Taverna • Not the only two but they are representative choices • BPEL • XML-based specification for web service based process behavior • Industry standard adopted by IBM, SAP, Oracle, etc. • Has also attracted attention from the scientific community because of its support for SOA paradigm • Taverna • Open-source, from the myGrid consortium in UK • Design and execution of scientific workflows • Plug-in architecture for extension (access more applications, visualize more data types, etc.)
Querying semantic data in cancer research • Identify description logic concepts relating to a particular context, e.g., “caCore” • Query all projects related to context “caCore” • find UML classes in each project • use project and UML class information to query the semantic metadata • retrieve the concept code • We adopt this query as a use case to guide our comparison 1 2 3 4
Support for service discovery • Before building a workflow • Need to find appropriate services to be composed • Service endpoints are not naturally known to users • Exact semantics of those services are not known Taverna offers • A extensible scavenger interface for arbitrary service discovery according to users needs (see next page) • A native semantic discovery facility called Feta: myGrid ontology based service annotation and search. BPEL offers • UDDI which is not widely adopted • Research efforts like: WSMO, OWL-S, which are more on specification level • No open-source tool is available that works with a service query component in an integrated way
Solution for caGrid: Metadata-based service query caGrid service metadata • Types of query • String based • Property based • Semantic based caGrid scavenger: query the CaDSR Service in the use case 1. Semantic/metadata based service discovery. 1. Semantic/metadata based service discovery. 2. Build a workflow using the services obtained by discovery. 2. Build a workflow using the services obtained by discovery. 3. Execute the workflow and view the results. 3. Execute the workflow and view the results.
Service composition & workflow execution • Data-driven vs. control-driven modeling • Implicit vs. explicit definition of data • Implicit vs. explicit iteration on data
Data-driven vs. control-driven modeling Comparison of BPEL and Taverna (Scufl) w.r.t. control/data-flow
Implicit vs. explicit definition of data • Taverna • Processors have input/output ports with an associated data type • Data travels from the output port of a processor to the input of one or more downstream processors • Interaction among processors is defined entirely by the arcs in the dataflow graph • BPEL • Requires the explicit definition of variables, and explicit initiation for complex types • Data are shared amongst activities (i.e., are global) • More complexity, but more power and flexibility in data handling
Implicit vs. explicit iteration on data • Implicit iteration in Taverna • Occurs when an input port receives a list element: • E.g., a processor that outputs a “list of strings,” can legally be connected to a processor with an input port of type “string.” • Taverna interprets this type mismatch as an indication that the destination processor must be invoked repeatedly, once for each element of the input list • This behavior is defined with Taverna's functional programming model • Explicit iteration in BPEL • BPEL does not allow type mismatch and iterate needs to be defined explicitly • Again, BPEL offers more flexibility to define more advanced iteration patterns (with more complexity in the model, though)
Implicit vs. explicit iteration in CaDSR • findProjects returns an array Project [] • findClassesInProject receives type Project and finds all UML classes in this (single) project • In Taverna an xmlsplitter extracts the project array and feeds this directly into findClassesInProject • In BPEL a ForEach construct is needed for the iteration over array Project []
Workflow result analysis • Workflow provides a natural framework for data tracking and analysis • In both Taverna and BPEL • Taverna: offers native provenance support • More precise linkage annotation between services’ input and output • Semantic support • Not the focus of our project, see ref. [16] [17] for more details
Conclusion: Taverna offers lifecycle support • Provides a compact set of primitives that eases the modeling of data flows • Allows users to specify “what to do” instead of “how to do it” composition Discovery • Scufl: compact modeling of data flow • Built-in processors: Soaplab, BioMart, etc. • Customized processors as plug-ins • Scavenger: for customized service discovery • Feta: service annotation and discovery. • Result persistence and visualization Community Execution reuse • Implicit iteration: handle parallel execution Analysis generate A community for sharing workflows caGrid + = ? + = ? + = ? + = ? caGrid caGrid caGrid caGrid
Conclusion: BPEL offers unique features • Build-time • A comprehensive set of primitives to model processes of all flavors • control-flow oriented • data-flow oriented (although a little verbose) • event driven, etc. • Full featured • process logic, data manipulation, event and message processing, fault handling, etc. • Run-time • BPEL engines typically run inside application servers with • persistent state storage • reliability and scalability guarantees • Important for long-running and computation-intensive workflows • For now Taverna engine does not provide these capabilities
Conclusion • Factors in deciding which language/tool to choose • User IT expertise • some prefer scripting language, others a friendly GUI • Problem size • Taverna often runs on desktop and handles problem of moderate size (currently common in bioinformatics) • Grid/server based systems like Swift can deal with huge volume of data and intensive computation (for example, applications in medical informatics, neuroscience, physics) • Applications involved • Web services, batch jobs, shell scripts, etc. • Future work • Enrich the caGrid workflow tool set based on Taverna • Build more real workflows to help scientific investigation • Address issues of scale as they arise
Introduction: caGrid and workflow instruments data Connectivity Virtualization Security caGrid computation resource