270 likes | 413 Views
Leveraging Open Source Technologies to Enable Scientific Discovery. Dan Crichton Program Manager, Data Systems and Technology, Earth Science Program Manager, Planetary Data System Engineering, Solar System Exploration Principal Computer Scientist October 20, 2010. A Quick Note about me.
E N D
Leveraging Open Source Technologies to Enable Scientific Discovery Dan Crichton Program Manager, Data Systems and Technology, Earth Science Program Manager, Planetary Data System Engineering, Solar System Exploration Principal Computer Scientist October 20, 2010
A Quick Note about me • Jet Propulsion Laboratory since 1995 • Interests in highly distributed, data intensive systems; multi-organizational systems; software and data system architectures • Program Manager, Earth Science Data Systems and Technology & Planetary Data System Engineering • Principal Investigator/co-investigator for multiple projects including Early Detection Research Network and the Object Oriented Data Technology project (released to Apache Software Foundation) • Also involved in standards development (e.g., CCSDS, IPDA, …) Leveraging Open Source Technologies to Enable Scientific Discovery
“eScience” Trend • Highly distributed, multi-organizational systems • Systems are moving towards loosely coupled systems or federations in order to solve science problems which span center and institutional environments • Sharing of data and services which allow for the discovery, access, and transformation of data • Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems • Infrastructures which are being built to handle future demand • Use of commodity services to address elasticity • Address complex modeling, inter-disciplinary science and decision support needs • Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions • Need to ensure information architecture support the varying science needs • Changing the way in which data analysis is performed • Moving towards analysis of distributed data to increase the study power • Enabling greater collaboration across centers • Systematizing, where possible Leveraging Open Source Technologies to Enable Scientific Discovery
Conceptual End-to-End Space Data Systems Architecture Relay Satellite Simple Information Object Spacecraft and Scientific Instruments Spacecraft / lander Science Data Archive External Science Community Primitive Information Object Primitive Information Object Science Information Package Science Information Package Science Data Processing Science Products - Information Objects Telemetry Information Package Science Information Package Data Analysis and Modeling Science Information Package Planning Information Object Instrument Planning Information Object Science Team Data Acquisition and Command Mission Operations Instrument /Sensor Operations • Common Meta Models for Describing Space Information Objects • Common Data Dictionary end-to-end DJC-4 Leveraging Open Source Technologies to Enable Scientific Discovery
Highly Distributed Science Environments Highly distributed/federated Collaborative Information-centric Discipline-specific Growing/evolving Heterogeneous (Implementations) Leveraging Open Source Technologies to Enable Scientific Discovery
Why Software Architecture? • Software Architecture: The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000) • Architecture is about strategy to address key architectural concerns… • How can we exploit common patterns to improve reuse? • Can we develop software product lines? • Can we improve interoperability? • Can we reduce dependencies? • What are the architectural principles..?: loosely-coupled, information-driven, highly distributed, commodity services, service oriented, collaborative/multi-institutional Leveraging Open Source Technologies to Enable Scientific Discovery
Notional Service Architectures Concept Client A Client B Service Interface C Service • The service architecture concept exploits many of the architectural concepts discussed • Loosely coupled • Elasticity (e.g. Commodity-based) • Multi-organizational • etc • At an enterprise-scale, architectures don’t need to prescribe what’s inside services….just their interfaces, function, behavior, etc… • Services might include…. • Data discovery • Data access • Security • Transformation C2 Architectural Style Leveraging Open Source Technologies to Enable Scientific Discovery
What does this have to do with open source? • In general, most NASA projects use open source software to some extent. However, … • The identification of core software product lines and tools, that can be reused, are excellent examples of opportunities to create open source projects • Across a federation of organizations, systems and users, what be developed and shared? • How can software components be developed in generic ways, but allow for extensions? • Open source itself is a strategy • Can improve collaborations • Can drive a robust set of reusable software components and tools • Can push standards development • Can encourage use of common architectural patterns Leveraging Open Source Technologies to Enable Scientific Discovery
Open Source Models • Software sharing with an open source license (e.g, BSD-style license) • Software distribution through open source organizations (e.g., SourceForge) • Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) • Ad hoc open source project communities with their own governance Leveraging Open Source Technologies to Enable Scientific Discovery
Open Source Models: My Opinion • Software sharing with an open source license (e.g, BSD-style license) • It’s a great start • Limited community involvement • Software distribution through open source organizations (e.g., SourceForge) • Provides good software distribution support • Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) • This moves from just distribution support to collaboration and governance over the development • Ad hoc open source project communities with their own governance • This can make a lot of sense for larger federations… Leveraging Open Source Technologies to Enable Scientific Discovery
The Apache Software Foundation • Largest open sourcesoftware development entity in the world • Over 2300+ committers • Over 3500+ contributors • 84 Top Level Projects • 36 Incubating • 30 Lab Projects • 8 retired projects in the “Attic” • Over 1.2 million revisions • Over 10M successful requests served a day across the world • HTTPD web server used on 100+ million web sites (52+% of the market) Leveraging Open Source Technologies to Enable Scientific Discovery
The Apache Software Foundation Model • Apache is interesting because of the structure in which it runs projects • Projects are not owned by one group (committers and reviewers, etc) • Projects get substantial support (issue tracking, mailing lists, release management, etc) • Projects get substantial review (level of maturity, rules and governance, etc) Leveraging Open Source Technologies to Enable Scientific Discovery
Apache Maturity Model • Start outwith Incubation • Grow community • Make releases • Gain interest • Diversify • When the project is ready, graduate into • Top-Level Project (TLP) • Sub-project of TLP • Increasingly, Sub-projects are discouraged compared to TLPs Leveraging Open Source Technologies to Enable Scientific Discovery
Apache Organization • Apache is a meritocracy • You earn your keep and your credentials • Start out as Contributor • Patches, mailing list comments, etc. • No commit access • Move onto Committer • Commit access, evolve the code • PMC Members • Have binding VOTEs on releases/personnel • Officer (VP, Project) • PMC Chair • ASF Member • Have binding VOTE in the state of the foundation • Elect Board of Directors • Director • Oversight of projects, foundation activities Leveraging Open Source Technologies to Enable Scientific Discovery
SourceForge (a different model) • Project Proposal • Accepted? Get going! • No foundation-wide oversight • Tons of dormant projects with no communities of interest • Goal is to host infrastructure andhost technologies • Goal is not to build communities • No foundation-wide rules or guidelines for committership or for project management • Dealt with locally by the progenitor of the project • Can lead to BDFL (benevolent dictator for life) syndrome • No foundation-wide license requirements • BSD, GPL(v2, v3), MIT, LGPL, etc all allowed Leveraging Open Source Technologies to Enable Scientific Discovery
OODT: An Open Source Framework for Building Distributed Science Data Mgmt Environments • Focus on • distribute environments • science data generation • data capture, end-to-end • access to science data by the community • A set of building blocks/services to exploit common system patterns for reuse • Entered the Apache incubator program in January 2010 • Used for a number of science data system activities http://incubator.apache.org/oodt/ Leveraging Open Source Technologies to Enable Scientific Discovery
e-Science Examples and OODT • Planetary Science Data System • Highly diverse (40 years of science data • from NASA and Int’l missions) • Geographically distributed; moving int’l • New centers plugging in (i.e. data nodes) • Multi-center data system infrastructure • Heterogeneous nodes with common • interfaces • Integrated based on enterprise-wide data • standards • Sits on top of COTS-based middleware • EDRN Cancer Research • Highly diverse (30+ centers performing • parallel studies using different instruments) • Geographically distributed • New centers plugging in (i.e. data nodes) • Multi-center data system infrastructure • Heterogeneous sites with common • interfaces allowing access to distributed • portals • Integrated based on common data standards • Secure (e.g. encryption, authentication, • authorization) Leveraging Open Source Technologies to Enable Scientific Discovery
Reuse on as an SDS for Missions • Leveraged OODT software framework for constructing ground data systems for earth science missions • Used OODT Catalog and Archive Service software • Focus is on “process management” • Constructed “workflows” • Execution of “processors” based on a set of rules • Explicit separation of workflow management from management of computational resources • Provided “lights out” operations • Multiple Missions • SeaWinds • QuikSCAT • Orbiting Carbon Observatory (OCO), OCO-2… • NP Sounder PEATE • SMAP SeaWinds on ADEOS II (Launched Dec 2002) Credit: D. Freeborn, C. Mattmann, D. Woollard Leveraging Open Source Technologies to Enable Scientific Discovery
Reuse as an SDS for Airborne Users & Science Community Modeling & Visualization Spacecraft & Other Data Sources Airborne Data A full service stack is deployed for each mission, utilizing any mission-provided resources as well as a cloud computing infrastructure. Mission proprietary data is presented in a mission-specific secure portal while publicly available data is aggregated in the public portal. Ground Sensors Credit: D. Freeborn, D. Woollard, E. Law, D. Crichton, L. Kay-Im, Leveraging Open Source Technologies to Enable Scientific Discovery 19
OODT in Climate…Climate Data Exchange Specific Tools (H2O, CO2, …) Credit: A. Braverman, C. Mattmann, D. Crichton, L. Cinquini, M. Cayanan Leveraging Open Source Technologies to Enable Scientific Discovery
OODT in Cancer Research: Early Detection Research Network • EDRN has pioneered the use of informatics technologies to support biomarker research • Both in capture and access to data • It has developed and successfully deployed its vision for a knowledge environment • EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers • Collaborative, Distributed • Informatics elements span multiple organizations • Model for biomarker discovery and validation • It supports capture and access to a diverse collection of distributed sets of information and results based on a core ontology for biomarker research • Biomarkers • Biospecimens • Scientific Data Sets • Protocols From: distributed research databases Credit: D. Crichton, C. Mattmann, S. Kelly A. Hart, H. Kincaid, S. Hughes Leveraging Open Source Technologies to Enable Scientific Discovery
Why Open Source Federations to Enable Research? • Building open source communities around product lines can be useful, but there is also a need to “rethink” how large federations work together • As mentioned, collaboration is continuing to be an increasing theme in science • Access to data in various systems • Sharing of services, etc • Common ways to represent information • There is also a need for federations to work with other federations • Planetary -> International Planetary Data Alliance • Earth -> Improve access to model and observations • Cancer -> Gov’t and non-profit enterprises being brought together (e.g., Canary Foundation and NIH) Leveraging Open Source Technologies to Enable Scientific Discovery
Earth System Grid Federation • DOE-funded federation to distribute climate model output to the climate modeling community • Common services for access to repositories and portals/gateways • Highly decoupled • Open source framework (software packaged and distributed) mandated by DOE SciDAC Program • A Recent question….how do you link federations? Leveraging Open Source Technologies to Enable Scientific Discovery
Implications of Open Source Federations • Cross-organizational teams are critical • A common architectural strategy is key • Monopolistic attitudes don’t work • Need to have a governance structure and adequate infrastructure to support software development • Need to address the IP/policy issues Leveraging Open Source Technologies to Enable Scientific Discovery
Lessons Learned • Reference Architectures can’t be overly prescriptive • Software product lines are useful for driving reuse strategies • Product lines need to be defined abstractly; mission-specific requirements need to be extensions • These make great open source projects • Get involved in open source communities • Our experience in working with Apache has been excellent • The IP/legal issues need to be worked • Work with the mission/science analysis/etc teams teams as earlier as possible to make the design trades Leveraging Open Source Technologies to Enable Scientific Discovery
A perspective to leave you with… • I believe NASA is at a point where a NASA earth science federation, based on an open source/collaborative model, might be very attractive for NASA’s next generation earth science data system enterprise for the following reasons: • Science benefits: can drive a growing enterprise of shared science services and software infrastructure support • Technology benefits: can drive innovation through its peer review and collaboration process • Infusion benefits: creates a defined process for contributing new ideas and capabilities • Architecture benefits: helps you build towards a common architectural vision and drive community standards • Cost benefits: can enable better leveraging and reuse of skills and capabilities across institutions • Tech Transfer Benefits: may benefit other science (and non-science disciplines) Leveraging Open Source Technologies to Enable Scientific Discovery
Questions? Thank You!!! Dan Crichton Dan.Crichton@jpl.nasa.gov (818) 354-9155 Note…we have several papers, book chapters on data intensive systems, etc that we’d be happy to share! A few key ones… D. Crichton, C. Mattmann, J. S. Hughes, S. Kelly, and A. Hart. “A Multi-Disciplinary, Model- Driven, Distributed Science Data System Architecture.” Guide to e-Science: Next Generation Scientific Research and Discovery. X. Yang, L. L. Wang, W. Jie, eds. Spring Verlag, 2010, To appear. D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. “A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer”. Accepted for publication at the 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, the Netherlands, December 4th-6th, 2006. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. Leveraging Open Source Technologies to Enable Scientific Discovery