1 / 41

A Tale of Two Workflows Roger Barga , Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Resear

A Tale of Two Workflows Roger Barga , Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Research The creative input of the Trident MSR summer ‘08 interns. MSR (Trident) Summer ‘08 Interns. Eran Chinthaka Indiana University. David Koop University of Utah.

nicole
Download Presentation

A Tale of Two Workflows Roger Barga , Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Resear

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tale of Two Workflows Roger Barga, Microsoft Research (MSR)Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Research The creative input of the Trident MSR summer ‘08 interns

  2. MSR (Trident) Summer ‘08 Interns Eran Chinthaka Indiana University David Koop University of Utah Satya Sahoo Wright State University Matt Valerio Ohio State University

  3. Trident Project Objectives Demonstrate that a commercial workflow management system can be used to implement scientific workflow Offer this system as an open source accelerator • Write once, deploy and run anywhere... • Abstract parallelism (HPC and many core); • Automatic provenance capture, for both workflow and results; • Costing model for estimating resource required; • Integrated data storage and access, in particular cloud computing; • Reproducible research; Develop this in the context of real eScience applications • Make sure we solve a real problem for actual project(s). And this is where things got really interesting...

  4. Scientific Workflow for Oceanography

  5. Workflowand the Neptune Array Workflow is a bridge between the underwater sensor array (instrument) and the end users Mandate • Make data available to researchers in (near-) real time • Store data for long term time-series studies Features • Allow human interaction with instruments; • Deployed instruments will change regularly, as will the analysis; • Facilitate automated, routine “survey campaigns”; • Support automated event detection and reaction; • User able to access through web (or custom client software); • ‏Best effort for most workflows is acceptable;

  6. Pan-STARRS Sky Survey Slide complements of Yogesh Simmhan • One of the largest visible light telescopes • 4 unit telescopes acting as one • 1 Gigapixel per telescope • Surveys entire visible universe in 1 week • Catalog solar system, moving objects/asteroids • ps1sc.org: UHawaii, Johns Hopkins, … Haleakala Observatory, Maui, Hawaii!!

  7. Pan-STARRS Highlights Slide complements of Yogesh Simmhan • 30TB of processed data/year • ~1PB of raw data • 5 billion objects; 100 million detections/week • Updated every week • SQL Server 2008 for storing detections • Distributed view over spatially partitioned databases • Replicated for fault tolerance • Windows 2008 HPC Cluster • Schedules workflow, monitor system

  8. Pan-STARRS Data Flow Slide complements of Yogesh Simmhan IPP csv csv csv csv csv csv Shared Data Store Load Merge 1 Load Merge 2 Load Merge 3 Load Merge 4 Load Merge 5 Load Merge 6 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 L1 L2 Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 HOT S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 WARM s 16 s 3 s 2 s 5 s 4 s 7 s 6 s 9 s 8 s 11 s 10 s 13 s 12 s 15 s 14 s 1 Main Main Distributed View

  9. The Pan-STARRS Workflows Slide complements of Yogesh Simmhan ← Behind the Cloud|| User facing services → Data Valet Workflows Astronomers (Data Consumers) The Pan-STARRS Science Cloud Data Consumer Queries & Workflows WarmSlice DB 1 Data Creators Load Workflow Load DB CSV Files Cold Slice DB 1 Image Procesing Pipeline (IPP) MyDB Merge Workflow Flip Workflow Hot Slice DB 2 Distributed View CSV Files Load DB CASJobs Query Service Load Workflow Telescope Merge Workflow Cold Slice DB 2 Distributed View Flip Workflow WarmSlice DB 2 Hot Slice DB 1 Validation Exception Notification MyDB Admin & Load-Merge Machines Slice Fault Recover Workflow Data flows in one direction→, except for error recovery Production Machines Supporting Provenance for the Scientist & the Data Valet

  10. Pan-STARRS Architecture Slide complements of Yogesh Simmhan Workflow is just a member of the orchestra <footer text>

  11. Workflowand Pan-STARRS Workflow carries out the data loading and merging Features • Support scheduling of workflows for nightly load and merge; • Offer only controlled (protected) access to the workflow system; • Workflows are tested, hardened and seldom change; • Not a unit of reuse or knowledge sharing; • Fault tolerance – ensure recovery and cleanup from faults; • Assign clean up workflows to undo state changes; • Provenance as a record of state changes (system management); • Performance monitoring and logging for diagnostics; • Must “play well” in a distributed system; • Provide ground truth for the state of the system;

  12. Other Partner Applications <footer text>

  13. Differing and Lurking Requirements • I want to do this more than once and get exactly the same answer. • I want to do this more than once, but don’t care if I get exactly the same answer. • I’m only going to do this once and don’t care about keeping the data or the results long term (but I need to remember the inputs); • I want to store the data in <local file, SQL Server, in the cloud, etc> • I want full provenance to validate a result, OPM compliant; • I want to use my own provenance management system; • Each group may wish a different UI (no WF), or authoring tool • I want any data from any agency or investigator even if the measurement sites aren’t quite co-located; I’ll deal with it later. • I only want NCAR, MBARI, etc. data because I trust it. • I know that Jon really wants my results to drive his model and I want to share my workflow and executables. Each of these potentially impacts the technology, user interface, and API design

  14. Why pay the price to architect? Divide and conquer • You can see all of the application components; • Different components share interfaces; • Different components developed by different people work together, even if someone else implements them; Go from working to working • Change one component, the rest keep working; • Scale up or down over time; • Testing components independently is possible; Full design, incremental implementation • Build what you need as you go; • Integrate new data sources, data types, analysis tools leverage the stable interfaces. Plug and play…

  15. Why not architect? • It’s hard • You have to accumulate user scenarios, map them to the technical components, and then understand the implications. • What are the dimensions of change/flexibility?

  16. Why not architect? • It’s hard • You have to accumulate user scenarios, map them to the technical components, and then understand the implications. • What are the dimensions of change/flexibility? • It doesn’t feel like you’re making progress • You spend a lot of time discovering what you already know. User scenarios often contain many of the same technical requirements again and again. • It’s not fun • You have to keep your interfaces stable longer (because you have dependencies on them), so that great idea has to wait for the next release • The design discussions can be rather “energetic” • It takes a team commitment

  17. How we decided to architect • Drive workflow development with 20 queries (workflows) • representative of the science • diverse enough to drive the design

  18. 20 Workflows for Neptune <footer text>

  19. 20 Workflows for Neptune <footer text>

  20. How we decided to architect • Drive workflow development with 20 queries (workflows) • representative of the science • diverse enough to drive the design • Introduce a registry as single ground truth for all state and objects.

  21. Trident RegistryRegistry Management

  22. Trident RegistryRegistry Management • Provides ground truth state for Trident • Captures provenance for workflows • Records information on running jobs • Meta data for all objects in Trident

  23. How we decided to architect • Drive workflow development with 20 queries (workflows) • representative of the science • diverse enough to drive the design • Introduce a registry as single ground truth for all state and objects. • Introduce an event blackboard for service communication;

  24. Trident Blackboard OverviewMatt Valerio, Satya Sahoo, Jared Jackson Shared Ontology Logging • Tracking • Design Blackboard Monitoring • Tracking • Resource Usage • User-Defined … Provenance Other publishers • Tracking • Design • Data Publisher Store Subscription Store … BlackboardMessage Subscription Profile Other publishers concept1 concept1 value1 concept3 concept2 value2

  25. Workflow Tracking • Workflow Events • Aborted • Changed • Completed • Created • Idle • Loaded • Persisted • Resumed • Started • Suspended • Terminated • Unloaded • Activity Events • Cancelling • Closed • Compensating • Executing • Faulting • Initialized Concept-Value Pairs Ontology concept1 value1 concept2 value2 concept3 value3 • Tracking Data • Instance ID • Activity Type • Activity Name • Timestamp • … concept4 value4 Mapping • User Events • User-defined Aggregate Subscription Profile Filtering concept1 • Why filter at the publisher? • Minimize network usage • Optimize performance (more messages/sec) concept3 BlackboardMessage Blackboard concept1 value1 concept3 value3 Send

  26. Workflow Monitoring • Goals • Real-time resource usage graphs (e.g. Silverlight) • Subscriber-initiated • Activity-initiated • Creation of cost models for each type of activity • Implementation • Subscribers listen for a specific resource concept • The monitoring service polls a resource monitor at regular intervals • The results are sent to the blackboard CPU Monitor 0% 100% SequenceActivity1 CpuIntensiveActivity1 MemoryIntensiveActivity1 CpuIntensiveActivity2 Time

  27. Illustration of Monitoring in Action

  28. How we decided to architect • Drive workflow development with 20 queries (workflows) • representative of the science • diverse enough to drive the design • Introduce a registry as single ground truth for all state and objects. • Introduce an event blackboard for service communication; • Choose specific interfaces between components and stick to them • APIs, object models, browser user screens and forms • Everything can be replaced and/or augmented

  29. Trident Registry Provider APIEran Chinthaka and Nelson Araujo Native Managed API API Managed Native Web Services Web Services

  30. Trident Registry Provider APIEran Chinthaka and Nelson Araujo

  31. How we decided to architect • Drive workflow development with 20 queries (workflows) • representative of the science • diverse enough to drive the design • Introduce a registry as single ground truth for all state and objects. • Introduce an event blackboard for service communication; • Choose specific interfaces between components and stick to them • APIs, object models, browser user screens and forms • Everything can be replaced and/or augmented • Separate the user interface to solve specific tasks • Separate authoring UI from runtime • Separate execution UI from runtime. It’s a workflow – what parameters do you want to set? What parts do you want to pause? Do over? Never do again? • Some things only work on the desktop; some things work best in the cloud. Enable users to select at runtime.

  32. Trident Interface for Neptune <footer text>

  33. Trident Interface for Neptune <footer text>

  34. Workflow SelectionDavid Koop, Nelson Araujo • Show me the workflows that • Process these data sets (sensor types); • Produce this kind of result (type of visualization, analysis); • Order these workflows by time it was last used; • Now apply this workflow to “this” area of the ocean;

  35. Trident Logical Architecture <footer text>

  36. myExperiment <footer text>

  37. Questions Scientific workflows for streamlining the data pipeline

More Related