460 likes | 706 Views
Scientific Workflow for Project Neptune. MICS’08 January 23 rd 2008. Trident. Trident. I’ll give an overview of this…. Scientific Workflow for Project Neptune. Roger Barga , Principal Architect Technical Computing at Microsoft http://www.microsoft.com/science.
E N D
Scientific Workflow for Project Neptune MICS’08 January 23rd 2008 Trident
Trident I’ll give an overview of this… Scientific Workflow for Project Neptune Roger Barga, Principal Architect Technical Computing at Microsoft http://www.microsoft.com/science and tell you a little about this… but first…
A New Science Paradigm Thousand years ago: Experimental Science - description of natural phenomena Last few hundred years: Theoretical Science - Newton’s Laws, Maxwell’s Equations … Last few decades: Computational Science - simulation of complex phenomena Today: e-Science or Data-centric Science - unify theory, experiment, and simulation - on regional and global scale • Data captured by instruments • Data generated by simulations • Data generated by sensor networks Slide thanks to Jim Gray
Data-Intensive Supercomputing:The case for DISCRandal E. BryantMay 10, 2007CMU-CS-07-128 “When a teenage boy wants to find information about his idol by using Google with the search query “Britney Spears,” he unleashes the power of several hundred processors operating on a data set of over 200 terabytes. Why then can’t a scientist seeking a cure for cancer invoke large amounts of computation over a terabyte-sized database of DNA microarray data at the click of a button?”
Technical Computing at MicrosoftTony Hey, Corporate Vice President, Microsoft Researchhttp://www.microsoft.com/science
Engaging with Researchers Jim Gray and his work with TerraServer and with the astronomy research community SkyServer contains 3 TBytes SDSS data built on SQL Server and .NET 380M web hits in 6 years, nearly 1M distinct users, 1600 papers David Heckerman working with HIV/AIDS researchers Simon Mallal, Perth, Australia:“We had been analyzing the data on our own for over 5 years, and despite the luxury of thinking about it for so long and so deeply, we were amazed that, within a few months using powerful machine learning techniques, Microsoft were able to zero in on the same observations much more quickly than we had been able to.” Stephen Emmott, European Science Initiative 2020 Science Report and Nature Over 100,000 copies and downloads NSF ‘Computational Thinking’ program owes much to report
Architecture Engagement Strategy Engage with scientists to: • Understand the requirements, across multiple projects • Demonstrateprototypes and proofs of concept • Develop software tools and technologies that support an “eResearch platform” for science. • Leverage this experience, try to transfer to new research communities, make it easy to deploy and use. How do we move from heroic scientists doing heroic science with heroic infrastructure to everyday scientists doing science they couldn’t do before? It’s the democratisation of eResearch!
Scientific Data Servers for Hydrology Work with Berkeley Water Center to use modern (relational) database technology 149 Ameriflux sites across the Americas reporting minimum of 22 common measurements Carbon-Climate Data published to and archived at Oak Ridge Total data reported to date on the order of 192M half-hourly measurements since 1994 http://public.ornl.gov/ameriflux/ Microsoft Project PI: Catharine van Ingen 8
Mashup of Ameriflux Sites Virtual Earth integration by Savas Parastatidis
eResearch Platformend-to-end support for research • SQL Server • GrayWulf • Data services, ala Sky Server • Famulus, institutional repository • SharePoint Server 2007 • Support Research Collaboration • Manage Research Objects (VRE) • eJournal • Compute Cluster Server • Integration with CCS • “HPC Live” – hosted services eResearch Platform • Open Source Tools • Machine Learning • Social Networking Tools • Windows Live • Portal for research project • Domain-specific services • Shared document libraries • Scalable Storage • Elastic Computation • Office 2007 • Ribbon for Research • Reference manager • Embedded provenance • ChemDraw for Word • Silverlight • Rich interactive apps across web • Interop: Windows, Linux, Mac • Windows Workflow Foundation • Workflow design and execution • Data analysis pipelines • Trident
eResearch Platformend-to-end support for research • SQL Server • GrayWulf • Data services, ala Sky Server • Famulus, institutional repository • SharePoint Server 2007 • Support Research Collaboration • Manage Research Objects (VRE) • eJournal • Compute Cluster Server • Integration with CCS • “HPC Live” – hosted services eResearch Platform • Open Source Tools • Machine Learning • Social Networking Tools • Windows Live • Portal for research project • Domain-specific services • Shared document libraries • Scalable Storage • Elastic Computation • Office 2007 • Ribbon for Research • Reference manager • Embedded provenance • ChemDraw for Word • Silverlight • Rich interactive apps across web • Interop: Windows, Linux, Mac • Windows Workflow Foundation • Workflow design and execution • Data analysis pipelines • Trident
How will we know that we have succeeded? A. When everyone is using Microsoft products and technologies B. When there are scientific advances, using both commercial and open source software, that wouldn’t have happened otherwise Not just accelerated but science they couldn’t do before...
Project Neptune Oceanography Today
What is Neptune? First large scale, long term ocean observatory Consists of 30-50 nodes (observation “stations”) placed strategically all about Juan de Fuca plate Provide real-time data to researchers, educators and students across the North American continent
Four Main Elements How do you think funds are to be allocated? Fiber optic cabling and fixed nodes comprising a network allowing for high speed communication Array deployed on the seafloor as well as in the water column An on-shore operation center to control all elements of the network Data management and archive center
How Does it Work? Fiber optic cabling will connect observatories and equipment forming a 200 000 sq km grid covering the Juan de Fuca plate • ~ 80 KW • 20 Gbps aggregate to shore Instruments deployed in boreholes and suspended in the water column as well as housed in the observatories, ready for remote deployment
How Does it Work? Fiber optic cabling will connect observatories and equipment forming a 200 000 sq km grid covering the Juan de Fuca plate • ~ 80 KW • 20 Gbps aggregate to shore Instruments deployed in boreholes and suspended in the water column as well as housed in the observatories, ready for remote deployment
How Does it Work? Fiber optic cabling will connect observatories and equipment forming a 200 000 sq km grid covering the Juan de Fuca plate • ~ 80 KW • 20 Gbps aggregate to shore Instruments deployed in boreholes and suspended in the water column as well as housed in the observatories, ready for remote deployment
Shore Station Start and/or end of backbone cable Power feed equipment IP (and Sonet gear), fibre illumination equipment Data acquisition software Data buffering End of the backhaul line
Data Management and Archive Center Software system that is link between underwater infrastructure and the users Mandate Make data available to researchers in (near-) real time Store data for long term time-series studies Features Allow human interaction with instruments Facilitate automated, routine “survey campaigns” Facilitate automated event detection and reaction User access through web (or custom client software) Support for scientific workflows (hours, not days…)
Scientific Workflow • E. Science laboris • Workflows are the new rock and roll of eScience • Machinery for coordinating the execution of (scientific) services and linkingtogether (scientific) resources. • Era of service oriented apps (SOA) • Repetitive and mundane boring tasks made easier (data cleaning...) • Facilitates sharing of science • Slide thanks to David DeRoure
Do we really need another workflow engine... Kepler Triana BPEL Ptolemy II
Trident Scientific Workflow Workbench • Visually program workflows, through a web browser • Libraries of activities, workflows and services • Social annotations and search, connect to myexperiment. • Streaming data, CEDR temporal stream model (CIDR’07) • Abstract parallelism, for many core platforms (CCR) • Adaptive workflows, to detect and respond to events • Automatic provenance capture, open provenance model • Costing model, resources include time, power, data xfer • Integrated data storage and access • Distribution, moving the work closer to the sensor array • Fault tolerance, facilitate smart reruns, what-if analysis • Factory scheduling of workflows
Trident ImplementationBuilt on top of industrial workflow engine Windows Workflow Foundation • Workflow in a general purpose framework • Part of Microsoft’s .NET Framework 3.0 3.5
Trident Scientific Workflow Workbench • Visually program workflows, through a web browser • Libraries of activities, workflows and services • Social annotations and search, connect to myexperiment. • Streaming data, CEDR temporal stream model (CIDR’07) • Abstract parallelism, for many core platforms (CCR) • Adaptive workflows, to detect and respond to events • Automatic provenance capture, open provenance model • Costing model, resources include time, power, data xfer • Integrated data storage and access • Distribution, moving the work closer to the sensor array • Fault tolerance, facilitate smart reruns, what-if analysis • Factory scheduling of workflows
Workflow Execution Provenance Scientists routinely record the provenance of bench experiments in lab notebooks – this is essential for computational experiments as well. For a workflow management system, provenance identifies what activities were executed, parameters supplied at runtime, data passed between activities, intermediate results generated, etc • Explain how a result was created – sufficient to establish trust; • Provides a replication recipe; • Guide the development of future experiments;
Provenance in Trident Enactment engine documents all steps linking original inputs with final result so an execution can be verified, reproduced or rerun – provenance as a first class data product… • Provenance capture is automatic and transparent • Will persist provenance data for a fixed period of time. • Supports multiple levels of representation. • Storage provided by underlying system • Interface to query and reason over provenance data. • Efficient storagerepresentation and query performance. • IPAW’06, Concurrency Practice and Experience 2007, AAAI SES’07 • Couple provenance collection and versioning
Motivation for a Layered Representation Result provenance in context What codes (activities) did I invoke to get this result, and what were the parameters? Which version of MM5 did I use? What machine was used to perform the regridding? How much time did it take? Were any steps skipped in this experiment, or were any adapters inserted? Did the experiment design differ between these two results? If so, where?... Are there any branches in the workflow that have not been explored? What experiments in my collection utilize regridding tools? Captured by the set of objects to which a result is incrementally bound Additional Considerations… Explain, validate or recreate – one size doesn’t fit all… Allow the user to control what is shared/exposed – ability to control and manage Result of a provenance query is an executable workflow
Distinct Levels in our Model • Experiment Design (L0) • Abstract Workflow – serializethe workflow schedule (XML) • Workflow Instantiation (L1) • Instantiation– bindings or instances of activities and data sets • Runtime Trace (L2) • Invocation of specific activities, events and rules • Deviations fromdefined schedule (activities skipped or inserted) • External Interaction (L3) • Input variables, runtime parameters, activation inputs • Services invoked, return value(s), etc • Job Provenance (L4) • Start/complete time, system id, etc. What about internal state?
Implementationextend base activity class Activities are the basic building blocks • The root of entire workflow is itself an activity • Composite activities contain other activitieseg: Sequence, Parallel, Synchronize, Exclusive Choice, Merge,… • Basic activities are steps within a workflow Activities are simply classes • We introduce properties andeventsto intercept and pass data to provenance capture service at runtime… (we really just post to an event “blackboard”)
... SEQUENCE Activity Workflow Activity 1 Workflow Activity N Transparent Interception and Logical Logging Each activity creating an operation history – a time serial stream of provenance records. Each record represents a change in workflow state, such as advancing a sequence, a synchronize or branch, activities passing data or invoking web services. Replay of log is an accurate repeated history of state changes, up to and including the “present” state Provenance Service “weaves” these records into the workflow XML, recording LSNs for individual activities, insertions (shims), etc.
App calls StartWorkflow(…) 1 • Instance Manager: • Loads workflow type • Creates instance • Enqueues WF1 with Scheduler 2 WF1 Invoke1 SchedulerdequeuesWF1, serializesXOML calls Executor(SequentialWorkflow base) whichenqueues Sequence 3 Execute SequentialWorkflow Dequeue Sequence & calls Executor which serializesActRec and enqueuesOnEvent1 4 Sequence Activity Dequeue OnEvent1, serializeActRec and call Executor which subscribes to event 5 InstanceMgr calls Flush() on WF1 (Activity base class) to flush provenance records and gets back stream 6 Scheduler Instance Mgr call Provenance service passing serialized stream – Provenance Storage service saves to disk 7 WF1 Workflow Execution My Experiment Create instance rt.StartWorkflow(typeof(WF1)); WF1 Instance Sequence OnEvent1 Execute Execute until idle MyWF.dll Base Activity Library Execute Save Persist provenance to disk Instance Manager Runtime Engine Sequence WF1 OnEvent1 Runtime Services Persist Provenance
Trident Runtime Servicespluggable service extensions Host Application App Domain Monitoring Service – monitors events and policies for dynamic workflow adaptation. Runtime Services MonitoringService PersistenceService stores and retrieves instance state. SQL PersistenceService TrackingService manages profiles and stores tracked information. TrackingService FaultToleranceService Automatic capture of provenance for workflows TransactionService Default resource service for managing threading and creating transactions Workflow Provenance Service
Visual Programming Vision for TridentOceanographers Workbench
Basic Composite Windows Workflow Base Activity Library
Extend activity Compose activities Read from Sensor Activities: An Extensible Approach Domain-Specific Workflow Packages Custom Activity Libraries Base Activity Library Rosetta net Biology Out-of-Box Activities CRM Oceanography • OOB activities, workflow types, base types • General-purpose • Define basic workflow constructs • Create/Extend/ Compose activities • Read from sensors, • Data pipelines, etc. • First-class citizens • Domain-specific activities • Domain specific workflow packages - oceanography
Convert Custom to Parameterized Activities Users simply select activities they wish to expose
Publish Custom Activities to the Web Parameterized and Web Accessible
This tool is designed to assist users with workflow creation and execution in the described system. It provides a simple point and click interface and does not require users to write any code or XML. Remote Authoring via Web Browser
Connected & Controllable Over the Internet Trident for NeptuneSensors, instruments & workflows controlled over Internet Source: Tony Hey Microsoft
Looking Ahead Trident Scientific Workflow Workbench Gathering requirements, research prototypes Development begins in March • Expect first beta Oct 2008 • small beta program on real science projects • Data intensive eScience • A new wave of scientific research is at hand • It’s about enabling new discoveries, not just accelerated • Researchers are not just consumers of infrastructure • Offer tools that empower them..