180 likes | 191 Views
This overview provides information on the Kepler Project, its goals, features, collaborators, and advancements. It highlights the project's aim to create a powerful analytical tool that supports scientists in biology, ecology, astronomy, and other disciplines. The text also discusses the distributed execution capabilities and the need for integrated management of external data in the Kepler system. Additionally, it introduces two new projects, REAP and ChIP-chip, that extend the functionality of Kepler in managing environmental observatory data and accelerating genome-scale biological research.
E N D
Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis and Synthesis University of California, Santa Barbara The Kepler ProjectOverview, Status, and Future Directions
The Kepler Project • Goals • Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features • access to scientific data • flexible means for executing complex analyses • enable use of Grid-based approaches to distributed computation • semantic models of scientific tasks • effective UI for workflow design
Kepler Collaboration • Open-source • Builds on Ptolemy II • Collaborators • SEEK Project • SciDAC SDM Center • Ptolemy Project • GEON Project • ROADNet Project • Resurgence Project • Goals • Create powerful analytical tools that are useful across disciplines • Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, … Ptolemy II
Usage statistics • Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics) • Source code access • 154 people accessed source code • 30 members have write permission Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh
Kepler advances • Data and Actor search • EarthGrid data access system • Kepler Component Library • Kepler Archive (KAR) format • Integrated support for LSID identifiers for all objects • Object Manager and cache • Web service execution • RExpression & MatlabExpression actors • Redesigned user interface • Authentication subsystem • Null-value handling
More advances • Documentation • Collection-oriented workflows (COMAD) • Domain-specific actors for case studies • e.g., GARP, phylogenetics actors • Provenance system • Grid computing support • NIMROD, Globus, ssh, ... • Semantics support • annotation, search, workflow validation, integration
Distributed execution • Opportunities for parallel execution • Fine-grained parallelism • Coarse-grained parallelism • Few or no cycles • Limited dependencies among components • ‘Trivially parallel’ • Many science problems fit this mold • parameter sweep, iteration of stochastic models • Current ‘plumbing’ approaches to distributed execution • workflow acts as a controller • stages data resources • writes job description files • controls execution of jobs on nodes • requires expert understanding of the Grid system • Scientists need to focus on just the computations • try to avoid plumbing as much as possible
Distributed Kepler • Higher-order component for executing a model on one or more remote nodes • Master and slave controllers handle setup and communication among nodes, and establish data channels • Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN OUT Controller Controller Master Slave
Data Management Token Token {1,5,2} ref-276 {1,5,2} • Need for integrated management of external data • EarthGrid access is partial, need refactoring • Include other data sources, such as JDBC, OpeNDAP, etc. • Data needs to be a first class object in Kepler, not just represented as an actor • Need support for data versioning to support provenance • e.g., Need to pass data by reference • workflows contain large data tokens (100’s of megabytes) • intelligent handling of unique identifiers (e.g., LSID) A B
New projects: REAP • Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System • Extend Kepler to: • Manage and monitor sensor networks • Consume data from sensors • Integrate sensor data handling with data archive handling • Terrestrial ecology and oceanography use cases PIs Jones Altintas Estrin Seabloom Gallagher Cornillon Hosseini Ludäscher Schildhauer Reichman Baru Potter Borer Institutions UCSB UCD UCSD UCLA OSU OpeNDAP
New projects: ChIP-chip • A Collaborative Scientific Workflow Environment for Accelerating Genome-Scale Biological Research • CS/IT: Ludaescher, Bowers, McPhillips • Bio: Peggy Farnham, Mark Bieda • Integrate a web-based "experiment workspace" environment with a flexible scientific workflow system • Support rapid prototyping and easy addition of new "methods" • templates • details of how key steps are left out until runtime, then late-binding of one or more specific algorithms or data • which is the best motif-finding algorithm? • parts of workflow have similar set of steps • need to compare results from parallel analyses • Support client-server (i.e., enterprise) deployments for group/lab-wide collaboration • different people have different roles [Software dev (Tim), Bioinformatics specialist (Mark), Biologist (Peggy)]
T1 T2 F1 F3 F2 F2 Experiment Workspace (setup, run, and manage) Setup “protocol” Import/Export Data Data Display, Visualization Run Experiment Peggy “biologist” experiment repository Workflow Automation (configuration and execution support) Configuration Management Execution Management Monitoring Support Provenance Tracking Kepler Workflow Engine provenance repository (1) select design template and configure (2) generate optimized executable workflow Workflow Specification (workflow design and template creation) Mark “bioinformatics specialist” design repository Component Specification (wrapping, integration, and creation of components) Tim “software developer” component repository external components and services ChIP-chip Data Analysis (ChIPOTle, HMM, …) Motif Finding Algorithms (MEME, MDscan, …) Visualization Packages and Statistics Tools Public Databases & Services (GenBank, David,TransFac, …) Figure from Bowers and McPhillips
Kepler C.O.R.E. proposal • Development of Kepler CORE -- A Comprehensive, Open, Robust, and Extensible Scientific Workflow Infrastructure • Ludäscher, Altintas, Bowers, Jones, McPhillips • Goals • Reliable • refactored build • more modular design • improved engineering practices • Independently extensible • Open architecture, open project • improved governance
Kepler C.O.R.E. -- Sustainability • How does Kepler persist? • Now, via research grants • unsustainable for production purposes • Future • new models for financial support • support contracts? • extension contracts? • new science domains? • continued research dollars? • foundations? • exploring 501.3c organization that can sustain Kepler and similar open-source initiatives
Acknowledgements • Funding • The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, 0225676, and 0619060. • The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. • The Andrew W. Mellon Foundation • The Department of Energy • Collaborators • NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas, University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis • Kepler contributors • SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence