280 likes | 419 Views
Workflow design and implementation issues in the VL-e project. P.Adriaans A Belloum. Outline. Background The Workflow design problem Virtual Laboratory for e-Science Our approach Challenges and research lines Activities. Workflow Design: The problem. Solution 1: Incremental clustering.
E N D
Workflow design and implementation issues in the VL-e project P.Adriaans A Belloum
Outline • Background • The Workflow design problem • Virtual Laboratory for e-Science • Our approach • Challenges and research lines • Activities
1700 Comparisons 3500 Comparisons
The Workflow design problem • A workflow is an inherent part of the problem solving heuristics • Induction of optimal workflows is an important research issue • Manipulating workflows is an important aspect of E-science
The KDD process • Cleaning • Domain consistency • De-duplication • Disambiguation Data selection Enrichment Coding Reporting & application • Data Mining • Clustering • Segmentation • Prediction Information requirements Action external data Feedback
Adaptive Information Disclosure Formulate query Fire query Search Construct answer Display results User support: Alternatives Disambiguation Query Expansion Filtering Relevance- score Link to Concept tree Data Selection Preprocessing Named Entity Recognition Relation Recognition • Advanced • Constraint • Recognition Validation Version Manage- Ment Ontology Domain selection Ontology Learning Information Retrieval
Application IT Overhead IT Overhead IT Overhead • Traditional position of ICT in science: • Application running on a single machine… • Little ICT overhead, no collaboration and/or • sharing of data and information • Evolving technological developments like WEB & Grid • and Service Oriented Architecture allow sharing of • data and information, thus enabling scientific • applications to do experiments that had not been • possible before… • Larger ICT overhead • e-Science is based on WEB &Grid and other application • supporting ICT… • Infrastructure will be helpful !! Application ICT Overhead
Application IT Overhead IT Overhead • Typical e-science applications require more than just one single resource, as well as sharing of resources • Moreover: • often resources (computing, storage, networks) are geographically distributed across different security domains building such a system: • introduces a large ICT overhead • requires extensive ICT Knowledge • Application scientist forced tofocus on ICT problemsrather than science • Recent developments in WEB&Grid based e-Science frameworks like VL-e are providing basic services which will help hiding computing resources to boost the development of data and computational intensive e-Science on a large scale distributed infrastructure. • Application scientist canfocus on his own sciencerather than ICT problems Application ICT Overhead
Application feedback Application specific service Medical Application Telescience Bio ASP Application Potential Generic service & Virtual Lab. services Virtual Lab. rapid prototyping (interactive simulation) Virtual Laboratory Additional Grid Services (OGSA services) Grid Middleware Grid & Network Services Surfnet Network Service (lambda networking) VL-E Experimental Environment VL-E Proof of concept Environment Stable Application & VL-e component Unstable Application & VL-e component Vl-E certification Environment A set of tests that have to be passed before any application software or VL-e component can be deployed on the VL-e proof of concept environment
Mission Effectively reuse existing workflow managements systems, and provide a generic e-Science framework for different application domains. A generic framework can • Improve the reuse of workflow components and workflows in different experiments • Reduce the learning cost needed for learning different systems • Allow users to work on a consistent environment when underlying infrastructure changed
Two phase approach • Recommend suitable workflow systems for different application domains: • Analyze typical application use cases • Define small projects with different application domains • Review existing workflow systems • Recommend four workflow systems: Triana, Taverna, Kepler, and VLAMG • A long term • Extend VLAMG and develop our own generic workflow framework Recommendation report: scientific workflow management in PoC R1 VL-e internal report, Oct 17, 2005.
Lessons learned from phase 1 • In the scientific community there are two types of workflow users: the end-users, the application developers. • The two categories of users have completely different requirements: easy-to-use, easy-for-developing new applications, and easy-for-migrating legacy applications • How to introduce a new WMS to a domain scientist? • Because it has a well defined architecture? • Or because it can allow him to keep their current work style? • How to reuse existing work? • Support multiple WMS systems or add more options to one WMS? • How to efficiently include user in the computing loop? Z. Zhao et al., “Scientific workflow management: between generality and applicability”, QSIC 2005, Australia
Distributed data sharing & dissemination Distributed resources Distributed Parallel computing Visualization, Remote resource invocation Computer support for problem solving • Problem Solving Environment: (E Gallopoulos et. al., IEEE CS Eng. 1994) • Organize different software components/ tools • Allows a user to assemble these tools at a high level of abstraction • Control runtime behavior of experiments • Examples: MATLab, Ptolemy, etc. Scientific Workflow Management: organize and execute on grid enabled resources! Traditional PSE: organize and execute resources locally!
Diversity in SWMS • Taverna: • Web services based language: Scufl; • FreeFluo: engine • Graphical viz of workflow • Triana: • Components • Task graph • Data/control flow • Kepler: • Actor,director • MoML • Execution models • Pegasus: • Based on DAGMan • VDL • DAG … • DAGMan: • Computing tasks • DAG
A workflow bus paradigm Workflow bus Z. Zhao et al., “Workflow bus for e-Science”, to appear IEEE e-Science 2006, Amsterdam
ws-VLAM Engine: architecture Service host(s) and compute element(s) GT4 Java Container Job functions GRAM services ws-RTSM Factory pre-ws-GRAM Client ws-RTSM Instance Worker nodes Delegate Delegation service Workflow components GRAM Ws-RTSM Instance Client Delegation Service ws-RTSM Factory
On going work • Objective: • Invoke ws-VLAM RTSM GT4 service from kepler/Taverna environment to execute a predefined Application workflow. • ws-VLAM Application workflow: • Scientific experiments composed of software components that need to be executed on Grid-enabled resources (CPU intensive) • Potential VLAM Application workflow can be described as: • a Pipeline of processes exchanging streams of data.
Execute the ws-VLAM workflow in Kepler/Taverna • A predefined Application workflow developed in VLAM can be executed as a single step in Kepler/Taverna • (no need to recompose graphically the whole workflow). • The predefined Application workflow will be executed on any remote computing resource where the VLAM-RTSM GT4 Web service is installed. • Advantages: • Compose workflow where sub-workflows (which require grid resources) are executed on grid-enabled resources, while the rest of the workflow is either executed using other Kepler actors or taverna processors • It is also more efficient, since it avoid the overhead which will result by wrapping every workflow component as a separated web service or a separate remote grid-execution.
Execute the ws-VLAM workflow in Kepler/Taverna Kepler/Taverna workbench RTSM-GT4 Web service (Available on DAS2 ) Das2 or PoC facilities. GT4 Java Container GRAM services (2) Service Invocation ws-RTSM Factory pre-ws-GRAM VLAM Actor or Taverna processor (To be developed) RTSM Client ws-RTSM Instance Worker nodes Workfow Description (XML) (1) Proxy Delegate Delegation service Workflow components • Kepler/Taverna users can have access to some of the parameters of the Application workflow to change the default values • Kepler/Taverna users have to specify the location of the input data file as URL and will get back a URL if the workflow generates data files • Graphical output of the Application workfloware handled automatically by the VLAM Taverna processor /Kepler actor.
Research scope and lines • Focus 1: Interoperability and integration between workflow systems • Focus 2: Composition of meta workflows • Focus 3: Provenance at meta workflows • Focus 4: Enactment and orchestration of meta workflows • Focus 5: Human in the loop computing in meta workflows Z. Zhao, A. Belloum, M. Bubark: A research plan of VL-e SP2.5 V0.2 September 9, 1006