1 / 57

The Apache OODT Ecosystem: A Birds Eye View

The Apache OODT Ecosystem: A Birds Eye View. Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation. And you are?. Apache Member involved in

lazar
Download Presentation

The Apache OODT Ecosystem: A Birds Eye View

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Apache OODT Ecosystem: A Birds Eye View Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation

  2. And you are? • Apache Member involved in • OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata(Mentor) • Senior Computer Scientist at NASA JPL in Pasadena, CA USA • Software Architecture/Engineering Prof at Univ. of Southern California NCAR-SEA-2012

  3. Agenda • Overview of OODT and its history • How we got it to Apache • How other projects can follow our model • Existing successful deployments of OODT • Pointers to papers, and more information including case studies NCAR-SEA-2012

  4. Lessons from 90’s era missions • Increasing data volumes (exponential growth) • Increasing complexity of instruments and algorithms • Increasing availability of proxy/sim/ancillary data • Increasing rate of technology refresh • … all of this while NASA Earth Mission funding was decreasing A data system framework based on a standard architecture and reusable software components for supporting all future missions. NCAR-SEA-2012

  5. Enter OODT Object Oriented Data Technology http://oodt.apache.org Funded initially in 1998 by NASA’s Office of Space Science Envisaged as a national software framework for sharing data across heterogeneous, distributed data repositories OODT is both an architecture and a reference implementation providing Data Production Data Discovery Data Distribution Data Access OODT is Open Source and available from the Apache Software Foundation NCAR-SEA-2012

  6. Apache OODT • Entered “incubation” at the Apache Software Foundation in 2010 • Selected as a top level Apache Software Foundation project in January 2011 • Developed by a community of participants from many companies, universities, and organizations • Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more • OODT Development & user community includes: http://oodt.apache.org NCAR-SEA-2012 6

  7. Apache OODT Press NCAR-SEA-2012

  8. Why Apache and OODT? • OODT is meant to be a set of tools to help build data systems • It’s not meant to be “turn key” • It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science • Each discipline/project extends • Apache is the elite open source community for software developers • Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) • Differs from other open source communities; it provides a governance and management structure NCAR-SEA-2012

  9. Governance Model+NASA=♥ • NASA and other government agencies have tons of process • They like that NCAR-SEA-2012

  10. Publicly accessible and searchable archives • http://svnsearch.org/svnsearch/repos/ASF/search?path=%2Foodt • http://mail-archives.apache.org/mod_mbox/oodt-dev/ • http://mail-archives.apache.org/mod_mbox/oodt-user/ • 100+ ML list subscriptions NCAR-SEA-2012

  11. Great Metrics and Insight • http://www.ohloh.net/p/oodt NCAR-SEA-2012

  12. Movement to the ASF • Meeting held June 15, 2007 at JPL with ASF President Justin Erenkrantz • Develop plan moving forward to bring first NASA project to Apache • Discuss obstacles, sponsorship • Discuss outlook NCAR-SEA-2012

  13. 2007: original goals • Come up with incubation proposal • Chris Mattmann was one of the principal contributors to the proposal for the Tika project, and to other Incubation activities (Apache SIS) • Send out emails to the Incubator mailing list • Look for mentors • Get sponsorship from ranking Apache PMC member or board member • Justin and others • Top-level project versus sub project outlook heading out of incubation NCAR-SEA-2012

  14. OODT Incubator Planning • Monthly Updates (for first 3 months, then quarterly) • Status • Progress • Community • Acceptance • Plan for exiting incubation • How to have a solid user base • How to operate as a unit in the Apache way • Maintenance of user interest and community going forward NCAR-SEA-2012

  15. OODT’s next steps circa 2007 • JPL to tackle legal issues • Is OODT releasable as an Apache product • http://www.apache.org/licenses/software-grant.txt • This needs to be signed by parties that be by JPL • Contributor License Agreement • Do we need a corporate one? • In parallel to this • Draft OODT incubation proposal • Start identifying who would initially be interested • More external, non-JPL people who are interested, the better • Justin to get slides from other incubator people NCAR-SEA-2012

  16. …2 years later • Worked it out with JPL legal • Turns out the ALv2 license is extremely friendly and is something that JPL (note not all of NASA) was amenable to • Developed OODT incubator proposal • http://wiki.apache.org/incubator/OODTProposal • Found willing Apache mentors besides Justin • Jean-Frederic Clere, Ross Gardler, Ian Holsman • …Put OODT at Apache! NCAR-SEA-2012

  17. Apache OODT Community • Includes PMC members from • NASA JPL, Univ. of Southern California, Google, Children’s Hospital Los Angeles (CHLA), Vdio, South African SKA Project • Projects that are deploying it operationally at • Decadal-survey recommended NASA Earth science missions, NIH, and NCI, CHLA, USC, South African SKA project • Use in the classroom • My graduate-level software architecture and seach engines courses NCAR-SEA-2012

  18. OODT Framework and PCS Catalog & Archive Service OODT/Science Web Tools Archive Client Navigation Service OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK Catalog & Archive Service (CAS) Process Control System (PCS) Bridge to External Services Other Service 1 Archive Service Profile Service Product Service Query Service Other Service 2 Profile XML Data Data System 1 Data System 2 CAS has recently become known as Process Control System when applied to mission work. NCAR-SEA-2012

  19. Current PCS deployments • Orbiting Carbon Observatory (OCO-2) - spectrometer instrument • NASA ESSP Mission, launch date: TBD 2013 • PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-based instrument data processing and Science Computing Facility • EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day • NPP Sounder PEATE - infrared sounder • Joint NASA/NPOESS mission, launch date: October 2011 • PCS supporting Science Computing Facility (PEATE) • EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day • QuikSCAT- scatterometer • NASA Quick-Recovery Mission, launch date: June 1999 • PCS supporting instrument data processing and science analyst sandbox • Originally planned as a 2-year mission • SMAP- high-res radar and radiometer • NASA decadal study mission, launch date: 2014 • PCS supporting radar instrument and science algorithm development testbed NCAR-SEA-2012

  20. Other PCS applications • Astronomy and Radio • Prototype work on MeerKAT with South Africans and KAT-7 telescope • Discussions ongoing with NRAO Socorro (EVLA and ALMA) • Bioinformatics • National Institutes of Health (NIH) National Cancer Institute’s (NCI) Early Detection Research Network (EDRN) • Children’s Hospital LA Virtual Pediatric Intensive Care Unit (VPICU) • Earth Science • NASA’s Virtual Oceanographic Data Center (VODC) • JPL’s Climate Data eXchange (CDX) • Technology Demonstration • JPL’s Active Mirror Telescope (AMT) • White Sands Missile Range NCAR-SEA-2012

  21. PCS Core Components • All Core components implemented as web services • XML-RPC used to communicate between components • Servers implemented in Java • Clients implemented in Java, scripts, Python, PHP and web-apps • Service configuration implemented in ASCII and XML files NCAR-SEA-2012

  22. Core Capabilities • File Manager does Data Management • Tracks all of the stored data, files & metadata • Moves data to appropriate locations before and after initiating PGE runs and from staging area to controlled access storage • Workflow Manager does Pipeline Processing • Automates processing when all run conditions are ready • Monitors and logs processing status • Resource Manager does Resource Management • Allocates processing jobs to computing resources • Monitors and logs job & resource status • Copies output data to storage locations where space is available • Provides the means to monitor resource usage NCAR-SEA-2012

  23. PCS Ingestion Use Case NCAR-SEA-2012

  24. File/Metadata Capabilities NCAR-SEA-2012

  25. PCS Processing Use Case NCAR-SEA-2012

  26. Advanced Workflow Monitoring NCAR-SEA-2012

  27. Resource Monitoring NCAR-SEA-2012

  28. How do we deploy PCS for a mission? We implement the following mission-specific customizations Server Configuration Implemented in ASCII properties files Product metadata specification Implemented in XML policy files Processing Rules Implemented as Java classes and/or XML policy files PGE Configuration Implemented in XML policy files Compute Node Usage Policies Implemented in XML policy files Here’s what we don’t change All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager) Core data management, pipeline process management and job scheduling/submission capabilities File Catalog schema Workflow Model Repository Schema NCAR-SEA-2012

  29. Server and PGE Configuration NCAR-SEA-2012

  30. What is the Level of Effort for personalizing PCS? • PCS Server Configuration – “days” • Deployment specific • Addition of New File (Product) Type – “days” • Product metadata specification • Metadata extraction (if applicable) • Ingest Policy specification (if remote pull or remote push) • Addition of a New PGE – (initial integration, ~ weeks) • Policy specification • Production rules • PGE Initiation • Estimates based on OCO and NPP experience NCAR-SEA-2012

  31. A typical PCS service (e.g., fm, wm, rm) NCAR-SEA-2012

  32. What’s PCS configuration? • Configuration follows typical Apache-like server configuration • A set of properties and flags that are set in an ASCII text file that initialize the service at runtime • Properties configure • The underlying subsystems of the PCS service • For file manager, properties configure e.g., • Data transfer chunk size • Whether or not the catalog database should use quoted strings for columns • What subsystems are actually chosen (e.g, database versus Lucene, remote versus local data transfer) • Can we see an example? NCAR-SEA-2012

  33. The concept of “production rules” • Production rules are common terminology to refer to the identification of the mission specific variation points in • PGE pipeline processing • Product cataloging and archiving • So far, we’ve discussed • Configuration • Policy • Policy is one piece of the puzzle in production rules NCAR-SEA-2012

  34. Production rule areas of concerns • Policy defining file ingestion • What metadata should PCS capture per product? • Where do product files go? • Policy defining PGE data flow and control flow • PGE pre-conditions • File staging rules • Queries to the PCS file manager service 1-5 are implemented in PCS (depending on complexity) as either: • Java Code • XML files • Some combination of Java code and XML files NCAR-SEA-2012

  35. PCS Task Wrapper aka CAS-PGE • Gathers information from the file manager • Files to stage • Input metadata (time ranges, flags, etc.) • Builds input file(s) for the PGE • Executes the PGE • Invokes PCS crawler to ingest output product and metadata • Notifies Workflow and Resource Managers about task (job) status • Can optionally • Generate PCS metadata files NCAR-SEA-2012

  36. Some relevant experience with NRAO: EVLA prototype • Explore JPL data system expertise • Leverage Apache OODT • Leverage architecture experience • Build on NRAO Socorro F2F given in April 2011 and Innovations in Data-Intensive Astronomy meeting in May 2011 • Define achievable prototype • Focus on EVLA summer school pipeline • Heavy focus on CASApy, simple pipelining, metadata extraction, archiving of directory-based products • Ideal for OODT system NCAR-SEA-2012

  37. Architecture NCAR-SEA-2012

  38. Pre-Requisites • Apache OODT • Version: 0.3 • JDK6, Maven2.2.1 • Stock Linux box NCAR-SEA-2012

  39. Installed Services • File Manager • http://ska-dc.jpl.nasa.gov:9000 • Crawler • http://ska-dc.jpl.na.gov:9020 • Tomcat5 • Curator: http://ska-dc.jpl.nasa.gov:8080/curator/ • Browser: http://ska-dc.jpl.nasa.gov/ • PCS Services: http://ska-dc.jpl.nasa.gov:8080/pcs/services/ • CAS Product Services: http://ska-dc.jpl.nasa.gov:8080/fmprod/ • Workflow Monitor: http://ska-dc.jpl.nasa.gov:8080/wmonitor/ • Met Extractors • /usr/local/ska-dc/pge/extractors (Cube, Cal Tables) • PCS package • /usr/local/ska-dc/pcs (scripts dir contains pcs_stat, pcs_trace, etc.) NCAR-SEA-2012

  40. Demonstration Use Case • Run EVLA Spectral Line Cube generation • First step is ingest EVLARawDataOutput from Joe • Then fire off evlascube event • Workflow manager writes CASApy script dynamically • Via CAS-PGE • CAS-PGE starts CASApy • CASApy generates Cal tables and 2 Spectral Line Cube Images • CAS-PGE ingests them into the File Manager • Gravy: UIs,Cmd Line Tools, Services NCAR-SEA-2012

  41. Results: Workflow Monitor NCAR-SEA-2012

  42. Results: Data Portal NCAR-SEA-2012

  43. Results: Prod Browser NCAR-SEA-2012

  44. Results: PCS Trace Cmd Line NCAR-SEA-2012

  45. Results: PCS Stat Cmd Line NCAR-SEA-2012

  46. Results: PCS REST Services: Trace curl http://host/pcs/services/pedigree/report/flux_redo.cal NCAR-SEA-2012

  47. Results: PCS REST Service: Health curl http://host/pcs/services/health/report Read up on https://issues.apache.org/jira/browse/OODT-139 Read documentation on PCS services: https://cwiki.apache.org/confluence/display/OODT/OODT+REST+Services NCAR-SEA-2012

  48. Results: RSS feed of prods NCAR-SEA-2012

  49. Results: RDF of products NCAR-SEA-2012

  50. Who’s doing what? • Children’s Hospital Los Angeles • Improving upon XMLPS, and CAS (Andrew Hart + Ricky Nguyen will talk about this) • Supporting data analytics • Google • Brian Foster working on command line improvements and data protocol push/pull • SKA South Africa • Deploying file manager and crawler for use in KAT-7 pipeline ingestion • NIH/NCI • Maintaining the XMLPS components, and CAS components • Helping with user interfaces • Various JPL and NASA research projects • OPeNDAPps, XMLPS • Various NASA missions • Workflow, PCS, services, OPSui, other web apps NCAR-SEA-2012

More Related