240 likes | 410 Views
Introduction to Apache OODT. Yang Li Mar 9, 2012. What is OODT. Object Oriented Data Technology Science data management Archiving Systems that span scientific disciplines
E N D
Introduction to Apache OODT Yang Li Mar 9, 2012
What is OODT • Object Oriented Data Technology • Science data management • Archiving Systems that span scientific disciplines • Enable interoperability among data agnostic systems (astrophysics, planetary, space science data systems, open source web analytics)
History • 2001 • deployed to make virtual specimen bank for Early Detection Research Network (oncology) • 2004 • Core architectural software of Planetary Data System Data Distribution deployed by NASA (planetary science) • 2007 • deployed for the Orbiting Carbon Observatory and Seawindsmissions (earth science) • 2008 • deployed in for National Polar-Orbiting Environmental Satellite System (atmospheric science)
Framework • Catalog & Archive • Utilities • Grid • Agility
Catalog & Archive • Deal with large-scale ingest of data, metadata extraction of data, post-processing of data into derived and higher-order products, cataloging of data, searching of catalogs, versioning, and retrieval • Components: • Catalog, Crawling framework, Curation, File manager, Metadata, PCS, Push/Pull framework, Resource management, Workflow, CAS install, Web apps
Catalog • Virtualize underlying catalogs for use in the CAS system • Heterogeneous catalog models are mapped to a common dictionary, and then integrated locally so that they may be queried across and ingested into
CAS Crawler • Standardize the common ingestion activities • identification of files and directories to crawl • satisfaction of ingestion pre-conditions • metadata extraction • Ingestion
Curation • A web application for managing policy for products and files and metadata that have been ingested via the CAS component • Use a servlet container to deploy the web app • Staging area • Directories on local machine holding data products • Metadata generation area • Create metadata files to associate with data products
File Manager • Provide everything to catalog, archive and manage files, and directories, and their associated metadata • Separate data stores and metadata stores as standard interfaces
Workflow • Provides everything to execute workflows, and science processing pipelines. • Separate workflow repositories and workflow engines as standard interfaces
Resource Management • Job management • Execution, monitoring, traking • Underlying software system and hardware resources • e.g. disk space, computational resources, and shared identity
Resource Management (Cont) • Critical objects • Job, Job Input, Job Spec, Job Instance, Resource Node
Metadata • A Multi-valued, generic Metadata container class • Internal map of string keys pointing to vectors of strings • [std:string key]⇒std:vector of std:strings
Framework • Catalog & Archive • Common Utilities • Grid • Agility
Common Utilities • Provide needed support for catalogs, archives, and grids • Query Expression • Platform neutral and extensible way of posing questions • Single Sign On • Commons • Lots of miscellaneous utilities, including I/O streams, logging, XML, and more
Query Expression • Provide a way to express queries in a generic manner • Use boolean postfix expressions to capture the domain, range, and constraint of a query, regardless of the source of the query • Encapsulate the results of a query • standard way to pass a query and its results between servers, clients, nodes, and other components.
Framework • Catalog & Archive • Utilities • Grid • Agility
Grid • Profile (metadata) and Product (data) services • Product • Retrieves resources (products) in platform-neutral formats • Profile • Describes and discovers resources using extensible metadata called "profiles" • Web Grid • provides profile and product services over a REST-ful interface. • XML Product/Profile handlers • provides XML-configurable, Database profile and product handlers.
Product • Provide access to data products • datasets, images, documents, or anything with an electronic representation • Accept standard query expressions and return zero or more matching products • Transform products from proprietary formats and into Internet standard formats without impacting local stores or operations.
Profile • Describes and Locates resources using metadata descriptions • resource's inception, composition, and location • Catalogs metadata descriptions and provides creating, updating, and querying capabilities.
Framework • Catalog & Archive • Utilities • Grid • Agility
Agility • Re-implementation of Grid in Python with a focus on high performance in the face of gargantuan data sets as well as accelerated development and integration into existing systems.