1 / 29

CMIP5 / ESG-CET Publication Tutorial

CMIP5 / ESG-CET Publication Tutorial. Bob Drach, PCMDI / LLNL March 14, 2011. ESG-CET Architecture. Gateways support centralized services: Portal Authn / Authz Search Metadata harvesting Web services Nodes are close to the data Publishing Data servers THREDDS DAP: Hyrax, PyDAP

zorion
Download Presentation

CMIP5 / ESG-CET Publication Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMIP5 / ESG-CET Publication Tutorial Bob Drach, PCMDI / LLNL March 14, 2011

  2. ESG-CET Architecture • Gateways support centralized services: • Portal • Authn / Authz • Search • Metadata harvesting • Web services • Nodes are close to the data • Publishing • Data servers • THREDDS • DAP: Hyrax, PyDAP • Visualization and computation • LAS • CDAT, NCL, Ferret, … • Gateways and nodes can be co-resident Portal Gateway Gateway Database MyProxy Web Services Harvester THREDDS LAS Data Node THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive

  3. Terminology • Publication makes datasets visible on the gateway (portal). • Only metadata is transferred • A dataset is a collection of files. • Datasets have versions. • The ‘unit of publication’ is a version of a dataset. • Versions: monotonically increasing integers, may be YYYYMMDD • Datasets have string dataset identifiers that are unique system-wide. • A category is a field that is searched on the gateway. • Ex: time_frequency, realm, experiment, … • Projects are activities that generate datasets. • Ex: CMIP5, CMIP3 • Associated with a set of categories • An experiment describes the input conditions (initial conditions, forcing, time period …) of a climate model experiment. • Data is generated by a climate model or from observations, reanalyses. • Other project-specific metadata may be associated with datasets.

  4. Node Architecture • Publisher • Scans data archive • Generates metadata catalogs: one catalog per dataset version • Notifies the gateway when new catalogs are available • Written in Python • Node database • Persistent store of publication information • Dataset contents, version history • File metadata • Catalog locations • Publication status • Current implementation in Postgres • May co-exist with gateway DB Harvester Gateway Web Services THREDDS LAS THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive

  5. Publication Process • Scan directories • Create a list of files to be published • Associate each file with a dataset • Generate a mapfile (optional) • Scan data • Read metadata from files • Populate node database • Generate metadata THREDDS catalogs • Publish datasets • Requires valid proxy certificate, obtained with myproxy-logon • Notifies gateway to harvest metadata Harvester Gateway Web Services THREDDS LAS THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive

  6. Publisher components • Scan directories / files to produce a mapfile% esgscan_directory [--read-files] [options] directory [directory ...] • Extract metadata from files, populate node database% esgpublish --map mapfile --project cmip5 • Generate THREDDS catalogs% esgpublish --map mapfile --noscan –-thredds • Notify gateway% esgpublish --map mapfile –-noscan -–publish • Publisher GUI includes all publisher functionality • All scripts have --help options Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive

  7. Deleting datasets • Order of operations is reverse of publication: • Delete from gateway • Remove TDS catalog • (optional) delete from node DB • Delete a dataset from the gateway% esgunpublish –skip-thredds cmip5.foo.bar • Delete a TDS catalog% esgunpublish –skip-gateway cmip5.foo.bar • Delete a dataset entirely, including the node database% esgunpublish –database-delete cmip5.foo.bar Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive

  8. Querying • List all CMIP5 datasets in the node database% esglist_datasets cmip5 • List all files in a dataset% esglist_files cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.fx. • List all datasets in a directory on a gateway% esgquery_gateway [--service-url gateway_service] --list pcmdi.CCCMA • List all files in a gateway dataset% esgquery_gateway --files cmip5.output2.CCCma.CanESM2.rcp85.mon.land.Lmon.r5i1p1 Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive

  9. THREDDS catalogs • Layout Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish THREDDS Master Catalog esgpublish --thredds esgunpublish THREDDS Catalogs ESG Root Catalog Proxy Certificate thredds_root = $ESGF_HOME/content/thredds/esgcet Node Database esgpublish esgunpublish /1 /2 /3 … myproxy-logon • Reinitialization loads all catalogs into the TDS thredds_reinit_url = https://localhost:443/thredds/admin/debug?catalogs/reinit esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive

  10. CMIP5 Metadata • CMIP5 DRS(Data Reference Syntax) defines the naming system for CMIP5 dataset identifiers, files, directories, URLs, metadata, … • CMIP5 controlled vocabulary is derived from the DRS document • Permitted values for experiments, models, institutions, … • Should be consistent with publisher configuration • esg.ini • esgcet_models_table.txt • cf-standard-name-table.xml • CMOR(Climate Model Output Rewriter) • Generates CMIP5-compliant data in netCDF format • Fortran-90, C, Python interfaces

  11. Publisher configuration, setup • Publisher locates the file in the order: • Environment variable ESGINI • $HOME/.esgcet/esg.ini • <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/esg.ini • esg.ini in working directory • esgsetup creates esg.ini • Run by esg-node installation script • If no existing configuration, starts with <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/template.ini • Created in $HOME/.esgcet/esg.ini • Otherwise updates existing esg.ini • Options: • --config: create initial configuration • --db: initialize database • --thredds: initialize THREDDS server • --publish: configure gateway-related options (service URLs, myproxy, etc.) • --handler: create customized handlers

  12. Configuration layout • Section headers • [DEFAULT]: Options apply to all sections • [project:foo]: Specific to project foo • Project name(s) are listed in project_options in [DEFAULT] • [initialize]: Locations of model and standard name tables • [extract]: File scan phase (metadata extraction) • Enable detailed logging • [srmls]: Listing SRM files • [hsi]: Listing HSI (HPSS mass store)

  13. Dataset roots, services • Dataset roots affect TDS access control, data hiding • thredds_dataset_roots = root_path | locationroot_path | location … • Every published file must be under a root location, is protected by ESG (by default) • Unpublished files under root location(s) are potentially accessible, but are not visible in TDS or the gateway • Do not store sensitive unpublished data under a dataset root! • Services configure access to files or aggregations • Simple or compound

  14. Project configuration • experiment_options defines experiments for the project • Categories • Metadata fields that will be associated with each dataset • Each project has a different set of categories • May be mandatory: error if not found during the scan • XX_options if enumerated • TDS catalog <property> element may be created • Basis of gateway search

  15. Project configuration • project handler encapsulates logic associated with reading / setting metadata values • ipcc5_builtin for CMIP5 • May be customized • Format strings • %(option)s • Option may be defined: • Config file • By handler (dynamically) • Example:%(model_description)s • Dataset_id: template for TDS dataset identifiers • Format strings should be mandatory • Version added by the publisher

  16. Project configuration • Maps • Mapping (association) from a set of independent fields to a dependent field • The dependent field can be used in a format string • Form:map_name = map(variable_1[, variable_2[, ...]] : variable_n) value_1 [ | value_2 [...]] | value_n value_1 [ | value_2 [...]] | value_n • Data file structure • One variable per file (CMIP5 standard)variable_per_file = true • Multiple variables per filevariable_per_file = false • Version • vYYYYMMDDversion_by_date=true • vN

  17. Offline datasets • Offline datasets: can be listed but not opened for metadata extraction • Published with minimal description: location and size • No associated aggregations • Example: tertiary storage • Lister: program that generates metadata for offline datasets • hsils.py: HPSS • srmls.py: SRM • msls.py: MSS • Listers can be customized • Configuration: • thredds_offline_services: generate TDS catalog <service> element • offline_lister: associate service name with [lister] section • [lister] section • Ex:[hsi]offline_lister_executable = %(pythonbin)s/hsils.pyhsi = /usr/local/bin/hsi • Use –offline, –service options with esgpublish, esgscan_directory

  18. Mapfiles • Describes file contents of one or more datasets • Generate with:% esgscan_directory [--read-files] [options] –o mapfile directory [directory ...] • File-specific fields • Size • Modification time: epochal time • Checksum (if checksum configuration option set) • Checksum type: MD5 (recommended for CMIP5) or SHA1 • Format: one line per file:dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]where properties are:mod_time=<epochal_time> checksum=<checksum_value> checksum_type=<checksum_type>, either MD5 or SHA1

  19. Directory Scan Modes • esgscan_directory [--read-files | --read-directories] … : • Associate dataset identifier(s) with files • Create listing of files with sizes, modification times, checksums, etc. • To generate dataset identifiers, must obtain metadata from either: • Directory names (--read-directories), or • File metadata (--read-files) • Example:dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s • File metadata: recommended for CMIP5 • For each file, read metadata from file and generate dataset_id • Directory names: recommended if file metadata is incomplete • For each directory: • Match directory_formatto directory to generate metadata • If directory does not match, no output for that directory • Somewhat faster, but harder to debug

  20. Publishing checksums – two approaches • First approach: Enable checksum generation by default. In esg.ini [DEFAULT] section:checksum = md5sum | MD5 • Problem with first approach: publication may slow significantly. • Second approach (V2.9.0+): disable checksum option, then: • Publish without checksums, initially • Generate checksums independently, add to a ‘mapfile’ foo.txt of the form:dataset_name | absolute_path | byte_length | checksum=value | checksum_type=MD5… • Add the checksums to the node database:% esgupdate_metadata foo.txt • Republish:% esgpublish --noscan --map foo.txt --project cmip5 --thredds –publish • Assumes that the dataset has not changed since initial publication • Query to list checksums:% esgquery_gateway –urls dataset_name

  21. Publishing replica datasets • Differs from non-replica datasets: • Maintains the replica version. (By default the publisher generates the dataset version) • Sets catalog properties to flag replicated status • Currently sets master_gateway property • Form of publication command for replication:% esgpublish –replica origin_host_id [--version-list versions.txt] other_options … • --version-list (V2.9.0+): • Text file of the form:dataset_name | versiondataset_name | version… • Proposed: add properties to the catalog for origin_host and publishing_host

  22. Publisher GUI • % esgpublish_gui & • Uses Tcl/Tk • Function menu • All functionality of publisher scripts • Dataset window • Listing of datasets being processed • Select dataset to display / edit metadata • Output window • Standard output, error messages • Status bar

  23. Publisher GUI • Metadata editor • Display / edit dataset-level (global) metadata • Fields are defined in esg.ini: • categories optionname | category_type | is_mandatory | is_thredds_property | display_orderEx: experiment| enum| true| true| 1 • Querying • Select datasets based on categories (model, experiment, …) • Categories are project-dependent

  24. Frequent Questions • How do I add a new model identifier? • Default models table in:<PYTHON_HOME>/lib/python2.X/site-packages/esgcet-2.Y.Z-py2.Z.egg/esgcet/config/etc/esgcet_models_table.txt • Copy default table to $HOME/.esgcet/esgcet_models_table.txt, add entry • In esg.ini [initialize] section: initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt • % esginitialize –c • Similar process for standard names: cf-standard-name-table.xml • esgscan_directory generates no output • Try –read-files option for CMIP5 • Check directory_format option in esg.ini • Cannot reinitialize TDS • Check thredds_reinit_url, thredds_username, thredds_password in esg.ini • Verify directly in browser

  25. Frequent Questions • Publication • Access denied • Need publisher privilege for group owning the parent dataset • Granted by gateway administrator • Logging • Publication • Logging to standard output by default • Define log_filename for file output • TDS • <TDS_CONTENT>/content/thredds/logs/ • Typically <TDS_CONTENT> = /esg or /usr/local/tomcat • Tomcat • $CATALINA_HOME/logs

  26. Resources • Data node documentationhttp://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/ • Publisher configuration reference:http://www2-pcmdi.llnl.gov/Members/bdrach/.personal/esg-publisher-configuration/ • CMIP5 controlled vocabulary:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5_controlled_vocab.txt/view • CMIP5 publication best practices:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5-best-practices/ • CMIP5 documentation:http://cmip-pcmdi.llnl.gov/cmip5/ • Data Reference Syntax (DRS):http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf • ESGF: Earth System Grid Federation:http://esgf.org/ • Node wiki has troubleshooting help:http://esgf.org/wiki/Cmip5DataNode

  27. Handlers, Customization • Handler: Python class that encapsulates project-specific logic for: • Controlling what metadata is associated with a project, how it is read (project handler) • basic_builtin, ipcc4_builtin, ipcc5_builtin • project_handler_name = ipcc5_builtin • I/O for specific formats (format handler) • netcdf_builtin reads netCDF files • format_handler_name = netcdf_builtin • Metadata standards • metadata_handler_name = cf_builtin • THREDDS catalog hook: user-supplied function modifies TDS catalog • Each type of handler may be customized • esgsetup –-handler creates skeleton package • Fill in required classes, functions • Creates a python package, independent of esgcet • Requires knowledge of Python • http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/customizing-the-esg-publisher-with-handlers/

  28. CMIP5-Specific Publication • Follow the DRS specification • dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s • Directory layout • Publisher allows any layout • Good idea to follow DRS-recommended layout if possible • Drslib has tools to manage DRS-style layout:http://esgf.org/esgf-drslib-site/ • Use date-style versions • version_by_date = true • Generate data with CMOR • Make sure esg.ini is up-to-date with CMIP5 controlled vocabulary

  29. Publication to PCMDI Gateway • Install the latest esgcet package • Check that the CMIP5 project configuration is up-to-date • Each publishing institution for the PCMDI gateway has an associated group: • BCC, CCCMA, CMCC, CNRM, DIAS, GFDL, NCCS • Different from data-producing institution: DIAS publishes MIROC and MRI data • Each institution has publication (write) access, optional administrative access • Publishing institution chooses group name • A top-level dataset exists for each group: • pcmdi.BCC, pcmdi.CCCMA, … • Initial read access is restricted, for test publications. • When datasets are ready for distribution, read access will be granted to the CMIP Research group.

More Related