260 likes | 422 Views
Distributed data access: THREDDS, OAI, CDP. Presented By: Michael Burek. Acknowledgments: CDP staff: Dave Brown, Luca Cinquini, Don Middleton, Rob Markel, Scott Nixon, Nate Wilhelmi. Outline. Community Data Portal (CDP) THREDDS in the CDP introduction THREDDS in detail
E N D
Distributed data access: THREDDS, OAI, CDP Presented By: Michael Burek Acknowledgments: CDP staff: Dave Brown, Luca Cinquini, Don Middleton, Rob Markel, Scott Nixon, Nate Wilhelmi
Outline • Community Data Portal (CDP) • THREDDS in the CDP introduction • THREDDS in detail • THREDDS applied in the CDP, some details • OAI -- Open archives initiative • Demo • Thoughts about future developments
Introduction to the CDP Community Data Portal (CDP) Project • UCAR wide, uniform, community resource for discovery (search and browse) across the organization • Search/browse: • Supports free or structured queries to find data • Boolean combinations • Keyword, controlled vocabularies • Creator, Publisher, Science Keyword (GCMD), Variable name (CF) • Data Format, Data Type, Data Delivery Service • Geographic, Time, Altitude • Data delivery Services • aggregation, subsetting, FTP, HTTP, Mass Store, LAS/FERRET, OPEnDAP
Introduction to the CDP, cont. • The CDP serves diverse range of data providers: • Project based archives -- small, often limited resources • Multi institutional teams -- geographically separated • Multiple data types within a project: measurements, models, images • The CDP cooperates with NCAR existing data organizations • A few unusual datasets -- HAO division • Model software. Visualizations.
CDP, Technologies • The CDP was begun in 2001 • Uses THREDDS* catalogs as to describe data content and structure • Uses Lucene as the search/discovery back end • Uses Open Archives Initiative OAI to share metadata • Uses SRM to access deep archive data, share data externally (ESG project) • Experimental use of SRB to share intra-institution • Sister site, Earth System Grid (ESG), uses grid technology to share data • Uses DODS/OPEnDAP for aggregation and subsetting data sets • Uses a distributed model for accessing data and metadata https://cdp.ucar.edu/ *Thematic Realtime Environmental Distributed Data Services
Introduction, THREDDS in the CDP • THREDDS is a schema used for DATA DELIVERY • Can be also used for geoscience data search and discovery THREDDS catalogs: • Are ingested into Lucene and GEO extent searching tools for search and discovery • Are used to supply data for search results and browse pages • Specify data access mechanisms • http, http restricted, OPEnDAP, MSS, TDS, LAS, GDS, CDP/agg • Point to and use non-THREDDS metadata • ESG, DC, NcML, NcML, GML, DIF • Can interoperate with WMO metadata when available
Introduction, THREDDS in the CDP, cont • The CDP federates directly with other sites that use THREDDS catalogs • NCAR DSS, NCAR EOL, UCAR UNIDATA • THREDDS catalogs are used inside DODS/OPEnDAP, GDS, and forthcoming Thredds Data Server • THREDDS will support a data access control system, locally and distributed
THREDDS Background • THREDDS v0.6 • Support for describing the hierarchical structure of datasets • Support for describing data delivery services • Some very basic descriptive metadata • Support for extensible and distributed catalogs • Support for “inheritance” of metadata and services • Allows other descriptive schemas to be part of the catalog • Emphasizes the hierarchical relationships between data items, containing datasets and groups of datasets
THREDDS V1.0 • THREDDS v1.0 • Added descriptive “minimal” metadata tuned for Earth Science search/discovery • “Minimal” defined -- Metadata sized for search/discovery • Again, Metadata can be inherited within the hierarchy • Design goal was to interoperate with core elements of DIF, ISO-19115, DC metadata • UNIDATA looking at incorporating THREDDS metadata in NetCDF* and forthcoming TDS** • Exploring possibly interoperating with BADC model extensions • V1.0x will have access control elements URL: http://my.unidata.ucar.edu/content/projects/THREDDS/index.htm *NetCDF UNIDATA defined binary data format for gridded and other geoscience data. Includes metadata that describes the data in the file header **TDS THREDDS data server -- will handle GRIB and NetCDF, will have WCS
THREDDS -- CDP • CDP THREDDS design choices • Use THREDDS descriptive metadata for search/discovery • Use GCMD DIF controlled vocabularies for science keyword hierarchies, creator, publisher, project • Use Climate and Forecasting CF conventions for variable names when applicable • Mandate use of unique identifier to identify data • Use forthcoming THREDDS elements for data access control • Use OAI to import DIF records from BADC and GCMD, transform these records into equivalent THREDDS for use in the CDP • Import ESG (CCSM) records (THREDDS, ESG), extract a subset of descriptive metadata for search and discovery
THREDDS, the details General Structure of a simple THREDDS catalog <catalog> <service name=“httpService” type=“HTTP” base=“http://dataportal.ucar.edu/data/abcData/”> <service name=“mssService” type=“MSS” base=“/mssRoot/abcData/”/ <dataset name=“abc” ID=“ucar.scd.cdp.datasetName”> <!-- container dataset --> <metdadata inherit=“true”> <!-- descriptive metadata --> <description type=“summary”> <creator> <geospatialCoverage> <!-- geographic location --> <….> <!-- other metadata (13 total) --> </metadata> <dataset ID=“ucar.scd.cdp.datasetName.item1”> <!-- describes a data item --> <dataSize units=“Kbytes”>123</datasize> <access serviceName=“httpService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> <access serviceName=”mssService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> </dataset> <more datasets> <!-- more dataset items --> </dataset> <! -- close enclosing dataset -> </catalog> Dataset URL = base + access points to local server or local service
THREDDS, simple catalog catalog service service HTTP data service Local data access/ local MSS service MSS data service dataset (container) metadata description creator geospatialCoverage other elements dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent
THREDDS, distributed catalogs example dataset.thredds.xml 1. Descriptive Metadata is in a separate file, could be on anther server. 2. Dataset contains references to remote catalogs. 3. Catalog Level Access control elements catalog metadata description creator geospatialCoverage other elements dataset (container) metadata link catalogRef ACCESS CONTROL Remote Server catalog (remote) ACCESS CONTROL service metadata description … datasets catalogRef Remote data services catalog (remote) service metadata description … datasets
THREDDS, database application example Virtual catalog service External HTTP data service Arbitrary Metadata Database External Server Database to THEDDS catalog builder (web service) metadata External Data hosting dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent
THREDDS, distributed data example catalog service service 1. Data is not on CDP, service is external, service can implement access control if required 2. Descriptive metadata is in a separate file, does not have to be THREDDS External HTTP data service MSS data service metadata description creator geospatialCoverage other elements External Server dataset (container) Metadata external reference Metadata external reference ISO-19115 iso-19115 elements External Data hosting dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent
CDP - distributed datasets, overview Community Data Portal Boston University SRB THREDDScatalog top SRB D NCAR Data Support Section D T T T LANL, ORNL, LBNL NCAR Atmospheric Chemistry. T LANL, ORNL, LBNL SRM LANL, ORNL, LBNL D A M T T NCAR EOL section SRM (ESG) T T Metadata database T T T D A A NCAR MSS T MASS Store M SRM T CDP data storage: WACCM, ACD, CME, CGD, …. A XSLT = Access control BADC OAI DIF T = THREDDS catalog DIFs OAI server DIF D DIF DIF D OAI client = Data Archive, M= MSS deep archive
THREDDS review/summary • THREDDS is a schema used for DATA DELIVERY • Contains basic geoscience discovery data • Is designed to work with distributed data, distributed metadata • Contains elements for data access restriction • Can work with real time data • Can be a container for non-THREDDS descriptive metadata • Defines the hierarchical relationships of datasets • Defines data delivery services • Supports a hierarchical view of metadata • Integrated with many data delivery and visualization services
Distributed Descriptive Metadata with OAI • Metadata is immediately “distributed” if metadata is contained in or is pointed to by THREDDS catalogs • Metadata can also be shared using OAI technology • OAI -- Open Archives Initiative from the Digital Library (DL) community • OAI is a web service definition for sharing metadata • OAI uses six verbs to define the service • OAI uses Dublin Core, DC, as the baseline schema • OAI can specify other XML schemas -- we use this capability • OAI can be used as a gateway to send information to an established DL community -- THREDDS -> DC => DL community via OAI • OAI disadvantage -- hierarchical relationships are lost
Distributed Metadata with OAI -- CDP • THREDDS records are “flattened” (hierarchy collapsed) one record -> one dataset • Flattened records are shared using OAI • For a test, the THREDDS records were transformed into DIF using XSLT • DIF records were ingested from BADC transformed into THREDDS catalogs, and ingested into CDP search and browse
CDP metadata architecture external metadata Web Interface/Web Service Metadata Conversion Catalog Parsing THREDDS catalog invokes write DIF metadata parse THREDDS records THREDDS records Metadata Processing THREDDS records read DC metadata index into Metadata repository XML viewer web application XML results Metadata DB (Lucene) passed to OAI client OAI server free-text Search Query UI Structured, Geospatial, Temporal Query UI import export THREDDS catalogs browser Web UI remote Data Center or Digital Library
Data publication on the CDP Is ingested Metadata indexing application Lucene Index THREDDS descriptive metadata Creates Dataset Disk, HTTP, Database, … Catalog crawler application Creates THREDDS hierarchy metadata XSLT rendering Allows Access control Edits HTTP Metadata Authoring tool Creates BROWSER CDP Catalog Presentation Starts link
Demo • Data searching: controlled vocabularies, GEO searching • Data browsing: access control • BADC shared metadata directory • Metadata editing • IDV Bundle showing integrated data source
Experimental Topology to share data? GISC -> CDP CDP WMO GISC DB THREDDS CATALOG W WMO metadata T THREDDS metadata THREDDS CATALOG 1. OAI transfers of WMO records 2. CDP Crawls data hierarchy -- no metadata 3. GISC creates Web interface to produce virtual THREDDS Catalogs (embedded WMO descriptive metadata) NetCDF GRIB … XSLT D W W W T W W W W HTTP W W W T CDP Search W WMO DCPC W W W Crawler W W W W OAI W OAI W OAI W T XSLT D OAI
Experimental Topology to share data CDP->GISC CDP WMO GISC THREDDS CATALOG NetCDF GRIB … W W W T WMO Search XSLT CDP Search W W W W OAI W W W W OAI W WMO metadata T THREDDS metadata