Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR YogeshSimmhan,Ed Lazowska, Alex Szalay, and Catharine van Ingen

Trident Project Objectives Demonstrate that a commercial workflow management system can be used to implement scientific workflow Offer this system as an open source accelerator • Write once, deploy and run anywhere... • Abstract parallelism (HPC and many core); • Automatic provenance capture, for both workflow and results; • Costing model for estimating resource required; • Integrated data storage and access, in particular cloud computing; • Reproducible research; Develop this in the context of real eScience applications • Make sure we solve a real problem for actual project(s). And this is where things started to get interesting...

Research Questions • Role of workflow in data intensive eScience • Explore architectural patterns/best practices • Scalability • Fault Tolerance • Provenance • Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research

Scientific Workflow for Oceanography

Workflowand the Neptune Array Workflow is a bridge between the underwater sensor array (instrument) and the end users Features • Allow human interaction with instruments; • Create ‘on demand’ visualizations of ocean processes; • Store data for long term time-series studies • Deployed instruments will change regularly, as will the analysis; • Facilitate automated, routine “survey campaigns”; • Support automated event detection and reaction; • User able to access through web (or custom client software); • ‏Best effort for most workflows is acceptable;

Pan-STARRS Sky Survey • One of the largest visible light telescopes • 4 unit telescopes acting as one • 1 Gigapixel per telescope • Surveys entire visible universe once perweek • Catalog solar system, • moving objects/asteroids • ps1sc.org: UHawaii, Johns Hopkins, …

Pan-STARRS Highlights • 30TB of processed data/year • ~1PB of raw data • 5 billion objects; 100 million detections/week • Updated every week • SQL Server 2008 for storing detections • Distributed over spatially partitioned databases • Replicated for fault tolerance • Windows 2008 HPC Cluster • Schedules workflow, monitor system

Pan-STARRS Data Flow IPP csv csv csv csv csv csv Shared Data Store Load Merge 1 Load Merge 2 Load Merge 3 Load Merge 4 Load Merge 5 Load Merge 6 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 L1 L2 Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 HOT S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 WARM s 16 s 3 s 2 s 5 s 4 s 7 s 6 s 9 s 8 s 11 s 10 s 13 s 12 s 15 s 14 s 1 Main Main Distributed View

Pan-STARRS Workflows ← Behind the Cloud|| User facing services → Data Valet Workflows Astronomers (Data Consumers) The Pan-STARRS Science Cloud Data Consumer Queries & Workflows WarmSlice DB 1 Data Creators Load Workflow Load DB CSV Files Cold Slice DB 1 Image Procesing Pipeline (IPP) MyDB Merge Workflow Flip Workflow Hot Slice DB 2 Distributed View CSV Files Load DB CASJobs Query Service Load Workflow Telescope Merge Workflow Cold Slice DB 2 Distributed View Flip Workflow WarmSlice DB 2 Hot Slice DB 1 Validation Exception Notification MyDB Admin & Load-Merge Machines Slice Fault Recover Workflow Data flows in one direction→, except for error recovery Production Machines

Pan-STARRS Architecture Workflow is just a member of the orchestra

Workflowand Pan-STARRS Workflow carries out the data loading and merging Features • Support scheduling of workflows for nightly load and merge; • Offer only controlled (protected) access to the workflow system; • Workflows are tested, hardened and seldom change; • Not a unit of reuse or knowledge sharing; • Fault tolerance – ensure recovery and cleanup from faults; • Assign clean up workflows to undo state changes; • Provenance as a record of state changes (system management); • Performance monitoring and logging for diagnostics; • Must “play well” in a distributed system; • Provide ground truth for the state of the system;

<footer text> Other Partner Applications

Data Creation, Ingest to End Use • Scientific data from sensors and instruments • Time series, spatially distributed • Need to be ingested before use • Go from Level 1 to Level 2 data • Potentially large, continuous stream of data • A variety of end users (consumers) of this data Workflows shepherd raw bits from instruments to usable data in databases in the Cloud

Curators Users in an eScience Eco System Reject/Fix New data upload Accepted Data Download New data upload Producer Data Valets Consumers Data Correction Data Correction Query Reject/ Fix Query Result Publish New Data Publishers Data Products

Generalized Architecture: GrayWulf Shared Compute Resources VALET WORKFLOW USER WORKFLOW User Interface Data Valet Queryable Data Store User Queryable Data Store Data Valet User Interface Shared Queryable Data Store User Storage Configuration Management, Health and Performance Monitoring Data Flow Operator User Interface Control Flow

PanSTARRS Workflows

PS Load & Merge Workflows Determine affine Slice Cold DB for CSV Batch Sanity Check of Network Files, Manifest, Checksum Create, Register empty LoadDB from template For Each CSV File in Batch Validate CSV File & Table Schema BULK LOAD CSV File into Table Perform CSV File/Table Validation Perform LoadDB/Batch Validation Start End Detect Load Fault. Launch Recovery Operations. Notify Admin. Determine ‘Merge Worthy’ Load DBs & Slice Cold DBs UNION ALL over Slice & Load DBs into temp. Filter on partition bound. For Each Partition in Slice Cold DB Switch OUT Slice partition to temp Switch IN temp to Slice partition Post Partition Load Validation Slice Column Recalculations & Updates Post Slice Load Validation Start End Detect Merge Fault. Launch Recovery Operations. Notify Admin.

System State Matters… • Monitor state of the system • Data centric and Process centric views • What is the Load/Merge state of each database in the system? • What are the active workflows in the system? • Drill down into actions performed: • On a particular database till date • By a particular workflow

Provenance for the Data ValetProvenance Drivers for Pan-STARRS • Need a way to monitor state of the system • Databases & Workflows • Need a way to recover from error states • Database states are modeled as a state transition diagram • Workflows cause transition from one state to another state Provenance forms an intelligent system log

Load DB State Diagram

Slice DB State Diagram

Fault Recovery • Faults are just another state • PS aims to support 2 degrees of failure • Upto 2 replicas out of 3 can fail and still be recovered

Fault Recovery • Provenance logs need to identify type and location of failure • Verification of fault paths • Attribution of failure to human error, infrastructure failure, data error • Global view of system state during fault

Provenance Data Model • Fine grained workflow activities • Activity does one task • Eases failure recovery • Capture inputs and outputs from workflow/activity • Relational/XML model for storing provenance • Generic model supports complex .NET types • Identify stateful data in parameters • Build a relational view on the data states • Domain specific view • Encodes semantic knowledge in view query

Reliability of Provenance System • Fault recovery depends on provenance • Missing provenance can cause unstable system upon faults • Provenance collection is synchronous • Provenance events published using reliable (durable) messaging • Guarantee that the event will be eventually delivered • Provenance is reliably persisted

Trident Logical Architecture Trident Workbench Visualization Design Workflow Packages Runtime Community Scientific Workflows Workbench Administration Console Web Portal Service & Activity Registry Windows Workflow Foundation Workflow Monitor Archiving Workflow Launcher Trident Runtime Services Provenance WinHPC Scheduling Fault Tolerance Monitoring Service Data Access Data Object Model (Database Agnostic Abstraction) SQL Server, SSDS Cloud DB, S3, …

Research Questions • Role of workflow in data intensive eScience Data Valet • Explore architectural patterns/best practices Scalability, Fault Tolerance and Provenance implemented through workflow patterns • Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research GrayWulf reference architecture

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR

Presentation Transcript

Trident Scientific Workflow Workbench

Data Workflow Management, Data Stewardship

Spatial Data Management in the Cloud

PODShell : Simplifying HPC in the Cloud Workflow

Data Management in the Cloud

Designing Flexible Workflow for Upstream Participation of the Scientific Data Community

Scientific Workflow Management

Building Accessibility Into The Workflow

MSR- Methodologies for Scientific Research

Evolving Scientific Data Workflow

Scientific Workflow Management

Data Management in Cloud Workflow Systems Dong Yuan

C-Store: Data Management in the Cloud

Scientific Workflows in the Cloud

Data Workflow Management, Data Stewardship

Running Scientific Workflow Applications on the Amazon EC2 Cloud

SCIENTIFIC DATA MANAGEMENT

The Scientific Data Management Center sdmcenter.lbl

Scientific Workflow Management

Scientific Data Management

GigE for the MSR