1 / 31

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan , Ed Lazowska, Alex Szalay , and Catharine van Ingen. Trident Project Objectives.

pavel
Download Presentation

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR YogeshSimmhan,Ed Lazowska, Alex Szalay, and Catharine van Ingen

  2. Trident Project Objectives Demonstrate that a commercial workflow management system can be used to implement scientific workflow Offer this system as an open source accelerator • Write once, deploy and run anywhere... • Abstract parallelism (HPC and many core); • Automatic provenance capture, for both workflow and results; • Costing model for estimating resource required; • Integrated data storage and access, in particular cloud computing; • Reproducible research; Develop this in the context of real eScience applications • Make sure we solve a real problem for actual project(s). And this is where things started to get interesting...

  3. Research Questions • Role of workflow in data intensive eScience • Explore architectural patterns/best practices • Scalability • Fault Tolerance • Provenance • Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research

  4. Scientific Workflow for Oceanography

  5. Workflowand the Neptune Array Workflow is a bridge between the underwater sensor array (instrument) and the end users Features • Allow human interaction with instruments; • Create ‘on demand’ visualizations of ocean processes; • Store data for long term time-series studies • Deployed instruments will change regularly, as will the analysis; • Facilitate automated, routine “survey campaigns”; • Support automated event detection and reaction; • User able to access through web (or custom client software); • ‏Best effort for most workflows is acceptable;

  6. Pan-STARRS Sky Survey • One of the largest visible light telescopes • 4 unit telescopes acting as one • 1 Gigapixel per telescope • Surveys entire visible universe once perweek • Catalog solar system, • moving objects/asteroids • ps1sc.org: UHawaii, Johns Hopkins, …

  7. Pan-STARRS Highlights • 30TB of processed data/year • ~1PB of raw data • 5 billion objects; 100 million detections/week • Updated every week • SQL Server 2008 for storing detections • Distributed over spatially partitioned databases • Replicated for fault tolerance • Windows 2008 HPC Cluster • Schedules workflow, monitor system

  8. Pan-STARRS Data Flow IPP csv csv csv csv csv csv Shared Data Store Load Merge 1 Load Merge 2 Load Merge 3 Load Merge 4 Load Merge 5 Load Merge 6 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 L1 L2 Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 HOT S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 WARM s 16 s 3 s 2 s 5 s 4 s 7 s 6 s 9 s 8 s 11 s 10 s 13 s 12 s 15 s 14 s 1 Main Main Distributed View

  9. Pan-STARRS Workflows ← Behind the Cloud|| User facing services → Data Valet Workflows Astronomers (Data Consumers) The Pan-STARRS Science Cloud Data Consumer Queries & Workflows WarmSlice DB 1 Data Creators Load Workflow Load DB CSV Files Cold Slice DB 1 Image Procesing Pipeline (IPP) MyDB Merge Workflow Flip Workflow Hot Slice DB 2 Distributed View CSV Files Load DB CASJobs Query Service Load Workflow Telescope Merge Workflow Cold Slice DB 2 Distributed View Flip Workflow WarmSlice DB 2 Hot Slice DB 1 Validation Exception Notification MyDB Admin & Load-Merge Machines Slice Fault Recover Workflow Data flows in one direction→, except for error recovery Production Machines

  10. Pan-STARRS Architecture Workflow is just a member of the orchestra

  11. Workflowand Pan-STARRS Workflow carries out the data loading and merging Features • Support scheduling of workflows for nightly load and merge; • Offer only controlled (protected) access to the workflow system; • Workflows are tested, hardened and seldom change; • Not a unit of reuse or knowledge sharing; • Fault tolerance – ensure recovery and cleanup from faults; • Assign clean up workflows to undo state changes; • Provenance as a record of state changes (system management); • Performance monitoring and logging for diagnostics; • Must “play well” in a distributed system; • Provide ground truth for the state of the system;

  12. <footer text> Other Partner Applications

  13. Data Creation, Ingest to End Use • Scientific data from sensors and instruments • Time series, spatially distributed • Need to be ingested before use • Go from Level 1 to Level 2 data • Potentially large, continuous stream of data • A variety of end users (consumers) of this data Workflows shepherd raw bits from instruments to usable data in databases in the Cloud

  14. Curators Users in an eScience Eco System Reject/Fix New data upload Accepted Data Download New data upload Producer Data Valets Consumers Data Correction Data Correction Query Reject/ Fix Query Result Publish New Data Publishers Data Products

  15. Generalized Architecture: GrayWulf Shared Compute Resources VALET WORKFLOW USER WORKFLOW User Interface Data Valet Queryable Data Store User Queryable Data Store Data Valet User Interface Shared Queryable Data Store User Storage Configuration Management, Health and Performance Monitoring Data Flow Operator User Interface Control Flow

  16. PanSTARRS Workflows

  17. PS Load & Merge Workflows Determine affine Slice Cold DB for CSV Batch Sanity Check of Network Files, Manifest, Checksum Create, Register empty LoadDB from template For Each CSV File in Batch Validate CSV File & Table Schema BULK LOAD CSV File into Table Perform CSV File/Table Validation Perform LoadDB/Batch Validation Start End Detect Load Fault. Launch Recovery Operations. Notify Admin. Determine ‘Merge Worthy’ Load DBs & Slice Cold DBs UNION ALL over Slice & Load DBs into temp. Filter on partition bound. For Each Partition in Slice Cold DB Switch OUT Slice partition to temp Switch IN temp to Slice partition Post Partition Load Validation Slice Column Recalculations & Updates Post Slice Load Validation Start End Detect Merge Fault. Launch Recovery Operations. Notify Admin.

  18. System State Matters… • Monitor state of the system • Data centric and Process centric views • What is the Load/Merge state of each database in the system? • What are the active workflows in the system? • Drill down into actions performed: • On a particular database till date • By a particular workflow

  19. Provenance for the Data ValetProvenance Drivers for Pan-STARRS • Need a way to monitor state of the system • Databases & Workflows • Need a way to recover from error states • Database states are modeled as a state transition diagram • Workflows cause transition from one state to another state Provenance forms an intelligent system log

  20. Load DB State Diagram

  21. Slice DB State Diagram

  22. Fault Recovery • Faults are just another state • PS aims to support 2 degrees of failure • Upto 2 replicas out of 3 can fail and still be recovered

  23. Fault Recovery • Provenance logs need to identify type and location of failure • Verification of fault paths • Attribution of failure to human error, infrastructure failure, data error • Global view of system state during fault

  24. Provenance Data Model • Fine grained workflow activities • Activity does one task • Eases failure recovery • Capture inputs and outputs from workflow/activity • Relational/XML model for storing provenance • Generic model supports complex .NET types • Identify stateful data in parameters • Build a relational view on the data states • Domain specific view • Encodes semantic knowledge in view query

  25. Reliability of Provenance System • Fault recovery depends on provenance • Missing provenance can cause unstable system upon faults • Provenance collection is synchronous • Provenance events published using reliable (durable) messaging • Guarantee that the event will be eventually delivered • Provenance is reliably persisted

  26. Trident Logical Architecture Trident Workbench Visualization Design Workflow Packages Runtime Community Scientific Workflows Workbench Administration Console Web Portal Service & Activity Registry Windows Workflow Foundation Workflow Monitor Archiving Workflow Launcher Trident Runtime Services Provenance WinHPC Scheduling Fault Tolerance Monitoring Service Data Access Data Object Model (Database Agnostic Abstraction) SQL Server, SSDS Cloud DB, S3, …

  27. Research Questions • Role of workflow in data intensive eScience Data Valet • Explore architectural patterns/best practices Scalability, Fault Tolerance and Provenance implemented through workflow patterns • Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research GrayWulf reference architecture

More Related