290 likes | 487 Views
CMIP5 / ESG-CET Publication Tutorial. Bob Drach, PCMDI / LLNL March 14, 2011. ESG-CET Architecture. Gateways support centralized services: Portal Authn / Authz Search Metadata harvesting Web services Nodes are close to the data Publishing Data servers THREDDS DAP: Hyrax, PyDAP
E N D
CMIP5 / ESG-CET Publication Tutorial Bob Drach, PCMDI / LLNL March 14, 2011
ESG-CET Architecture • Gateways support centralized services: • Portal • Authn / Authz • Search • Metadata harvesting • Web services • Nodes are close to the data • Publishing • Data servers • THREDDS • DAP: Hyrax, PyDAP • Visualization and computation • LAS • CDAT, NCL, Ferret, … • Gateways and nodes can be co-resident Portal Gateway Gateway Database MyProxy Web Services Harvester THREDDS LAS Data Node THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive
Terminology • Publication makes datasets visible on the gateway (portal). • Only metadata is transferred • A dataset is a collection of files. • Datasets have versions. • The ‘unit of publication’ is a version of a dataset. • Versions: monotonically increasing integers, may be YYYYMMDD • Datasets have string dataset identifiers that are unique system-wide. • A category is a field that is searched on the gateway. • Ex: time_frequency, realm, experiment, … • Projects are activities that generate datasets. • Ex: CMIP5, CMIP3 • Associated with a set of categories • An experiment describes the input conditions (initial conditions, forcing, time period …) of a climate model experiment. • Data is generated by a climate model or from observations, reanalyses. • Other project-specific metadata may be associated with datasets.
Node Architecture • Publisher • Scans data archive • Generates metadata catalogs: one catalog per dataset version • Notifies the gateway when new catalogs are available • Written in Python • Node database • Persistent store of publication information • Dataset contents, version history • File metadata • Catalog locations • Publication status • Current implementation in Postgres • May co-exist with gateway DB Harvester Gateway Web Services THREDDS LAS THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive
Publication Process • Scan directories • Create a list of files to be published • Associate each file with a dataset • Generate a mapfile (optional) • Scan data • Read metadata from files • Populate node database • Generate metadata THREDDS catalogs • Publish datasets • Requires valid proxy certificate, obtained with myproxy-logon • Notifies gateway to harvest metadata Harvester Gateway Web Services THREDDS LAS THREDDS Catalogs Publisher Node Database gridFTP Proxy Certificate Data Archive
Publisher components • Scan directories / files to produce a mapfile% esgscan_directory [--read-files] [options] directory [directory ...] • Extract metadata from files, populate node database% esgpublish --map mapfile --project cmip5 • Generate THREDDS catalogs% esgpublish --map mapfile --noscan –-thredds • Notify gateway% esgpublish --map mapfile –-noscan -–publish • Publisher GUI includes all publisher functionality • All scripts have --help options Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive
Deleting datasets • Order of operations is reverse of publication: • Delete from gateway • Remove TDS catalog • (optional) delete from node DB • Delete a dataset from the gateway% esgunpublish –skip-thredds cmip5.foo.bar • Delete a TDS catalog% esgunpublish –skip-gateway cmip5.foo.bar • Delete a dataset entirely, including the node database% esgunpublish –database-delete cmip5.foo.bar Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive
Querying • List all CMIP5 datasets in the node database% esglist_datasets cmip5 • List all files in a dataset% esglist_files cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.fx. • List all datasets in a directory on a gateway% esgquery_gateway [--service-url gateway_service] --list pcmdi.CCCMA • List all files in a gateway dataset% esgquery_gateway --files cmip5.output2.CCCma.CanESM2.rcp85.mon.land.Lmon.r5i1p1 Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish esgpublish --thredds esgunpublish THREDDS Catalogs Proxy Certificate Node Database esgpublish esgunpublish myproxy-logon esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive
THREDDS catalogs • Layout Web Services esgquery_gateway THREDDS esgpublish --publishesgunpublish THREDDS Master Catalog esgpublish --thredds esgunpublish THREDDS Catalogs ESG Root Catalog Proxy Certificate thredds_root = $ESGF_HOME/content/thredds/esgcet Node Database esgpublish esgunpublish /1 /2 /3 … myproxy-logon • Reinitialization loads all catalogs into the TDS thredds_reinit_url = https://localhost:443/thredds/admin/debug?catalogs/reinit esginitialize esgsetup mapfile esglist_datasets esglist_files esgscan_directory Data Archive
CMIP5 Metadata • CMIP5 DRS(Data Reference Syntax) defines the naming system for CMIP5 dataset identifiers, files, directories, URLs, metadata, … • CMIP5 controlled vocabulary is derived from the DRS document • Permitted values for experiments, models, institutions, … • Should be consistent with publisher configuration • esg.ini • esgcet_models_table.txt • cf-standard-name-table.xml • CMOR(Climate Model Output Rewriter) • Generates CMIP5-compliant data in netCDF format • Fortran-90, C, Python interfaces
Publisher configuration, setup • Publisher locates the file in the order: • Environment variable ESGINI • $HOME/.esgcet/esg.ini • <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/esg.ini • esg.ini in working directory • esgsetup creates esg.ini • Run by esg-node installation script • If no existing configuration, starts with <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/template.ini • Created in $HOME/.esgcet/esg.ini • Otherwise updates existing esg.ini • Options: • --config: create initial configuration • --db: initialize database • --thredds: initialize THREDDS server • --publish: configure gateway-related options (service URLs, myproxy, etc.) • --handler: create customized handlers
Configuration layout • Section headers • [DEFAULT]: Options apply to all sections • [project:foo]: Specific to project foo • Project name(s) are listed in project_options in [DEFAULT] • [initialize]: Locations of model and standard name tables • [extract]: File scan phase (metadata extraction) • Enable detailed logging • [srmls]: Listing SRM files • [hsi]: Listing HSI (HPSS mass store)
Dataset roots, services • Dataset roots affect TDS access control, data hiding • thredds_dataset_roots = root_path | locationroot_path | location … • Every published file must be under a root location, is protected by ESG (by default) • Unpublished files under root location(s) are potentially accessible, but are not visible in TDS or the gateway • Do not store sensitive unpublished data under a dataset root! • Services configure access to files or aggregations • Simple or compound
Project configuration • experiment_options defines experiments for the project • Categories • Metadata fields that will be associated with each dataset • Each project has a different set of categories • May be mandatory: error if not found during the scan • XX_options if enumerated • TDS catalog <property> element may be created • Basis of gateway search
Project configuration • project handler encapsulates logic associated with reading / setting metadata values • ipcc5_builtin for CMIP5 • May be customized • Format strings • %(option)s • Option may be defined: • Config file • By handler (dynamically) • Example:%(model_description)s • Dataset_id: template for TDS dataset identifiers • Format strings should be mandatory • Version added by the publisher
Project configuration • Maps • Mapping (association) from a set of independent fields to a dependent field • The dependent field can be used in a format string • Form:map_name = map(variable_1[, variable_2[, ...]] : variable_n) value_1 [ | value_2 [...]] | value_n value_1 [ | value_2 [...]] | value_n • Data file structure • One variable per file (CMIP5 standard)variable_per_file = true • Multiple variables per filevariable_per_file = false • Version • vYYYYMMDDversion_by_date=true • vN
Offline datasets • Offline datasets: can be listed but not opened for metadata extraction • Published with minimal description: location and size • No associated aggregations • Example: tertiary storage • Lister: program that generates metadata for offline datasets • hsils.py: HPSS • srmls.py: SRM • msls.py: MSS • Listers can be customized • Configuration: • thredds_offline_services: generate TDS catalog <service> element • offline_lister: associate service name with [lister] section • [lister] section • Ex:[hsi]offline_lister_executable = %(pythonbin)s/hsils.pyhsi = /usr/local/bin/hsi • Use –offline, –service options with esgpublish, esgscan_directory
Mapfiles • Describes file contents of one or more datasets • Generate with:% esgscan_directory [--read-files] [options] –o mapfile directory [directory ...] • File-specific fields • Size • Modification time: epochal time • Checksum (if checksum configuration option set) • Checksum type: MD5 (recommended for CMIP5) or SHA1 • Format: one line per file:dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]where properties are:mod_time=<epochal_time> checksum=<checksum_value> checksum_type=<checksum_type>, either MD5 or SHA1
Directory Scan Modes • esgscan_directory [--read-files | --read-directories] … : • Associate dataset identifier(s) with files • Create listing of files with sizes, modification times, checksums, etc. • To generate dataset identifiers, must obtain metadata from either: • Directory names (--read-directories), or • File metadata (--read-files) • Example:dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s • File metadata: recommended for CMIP5 • For each file, read metadata from file and generate dataset_id • Directory names: recommended if file metadata is incomplete • For each directory: • Match directory_formatto directory to generate metadata • If directory does not match, no output for that directory • Somewhat faster, but harder to debug
Publishing checksums – two approaches • First approach: Enable checksum generation by default. In esg.ini [DEFAULT] section:checksum = md5sum | MD5 • Problem with first approach: publication may slow significantly. • Second approach (V2.9.0+): disable checksum option, then: • Publish without checksums, initially • Generate checksums independently, add to a ‘mapfile’ foo.txt of the form:dataset_name | absolute_path | byte_length | checksum=value | checksum_type=MD5… • Add the checksums to the node database:% esgupdate_metadata foo.txt • Republish:% esgpublish --noscan --map foo.txt --project cmip5 --thredds –publish • Assumes that the dataset has not changed since initial publication • Query to list checksums:% esgquery_gateway –urls dataset_name
Publishing replica datasets • Differs from non-replica datasets: • Maintains the replica version. (By default the publisher generates the dataset version) • Sets catalog properties to flag replicated status • Currently sets master_gateway property • Form of publication command for replication:% esgpublish –replica origin_host_id [--version-list versions.txt] other_options … • --version-list (V2.9.0+): • Text file of the form:dataset_name | versiondataset_name | version… • Proposed: add properties to the catalog for origin_host and publishing_host
Publisher GUI • % esgpublish_gui & • Uses Tcl/Tk • Function menu • All functionality of publisher scripts • Dataset window • Listing of datasets being processed • Select dataset to display / edit metadata • Output window • Standard output, error messages • Status bar
Publisher GUI • Metadata editor • Display / edit dataset-level (global) metadata • Fields are defined in esg.ini: • categories optionname | category_type | is_mandatory | is_thredds_property | display_orderEx: experiment| enum| true| true| 1 • Querying • Select datasets based on categories (model, experiment, …) • Categories are project-dependent
Frequent Questions • How do I add a new model identifier? • Default models table in:<PYTHON_HOME>/lib/python2.X/site-packages/esgcet-2.Y.Z-py2.Z.egg/esgcet/config/etc/esgcet_models_table.txt • Copy default table to $HOME/.esgcet/esgcet_models_table.txt, add entry • In esg.ini [initialize] section: initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt • % esginitialize –c • Similar process for standard names: cf-standard-name-table.xml • esgscan_directory generates no output • Try –read-files option for CMIP5 • Check directory_format option in esg.ini • Cannot reinitialize TDS • Check thredds_reinit_url, thredds_username, thredds_password in esg.ini • Verify directly in browser
Frequent Questions • Publication • Access denied • Need publisher privilege for group owning the parent dataset • Granted by gateway administrator • Logging • Publication • Logging to standard output by default • Define log_filename for file output • TDS • <TDS_CONTENT>/content/thredds/logs/ • Typically <TDS_CONTENT> = /esg or /usr/local/tomcat • Tomcat • $CATALINA_HOME/logs
Resources • Data node documentationhttp://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/ • Publisher configuration reference:http://www2-pcmdi.llnl.gov/Members/bdrach/.personal/esg-publisher-configuration/ • CMIP5 controlled vocabulary:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5_controlled_vocab.txt/view • CMIP5 publication best practices:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5-best-practices/ • CMIP5 documentation:http://cmip-pcmdi.llnl.gov/cmip5/ • Data Reference Syntax (DRS):http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf • ESGF: Earth System Grid Federation:http://esgf.org/ • Node wiki has troubleshooting help:http://esgf.org/wiki/Cmip5DataNode
Handlers, Customization • Handler: Python class that encapsulates project-specific logic for: • Controlling what metadata is associated with a project, how it is read (project handler) • basic_builtin, ipcc4_builtin, ipcc5_builtin • project_handler_name = ipcc5_builtin • I/O for specific formats (format handler) • netcdf_builtin reads netCDF files • format_handler_name = netcdf_builtin • Metadata standards • metadata_handler_name = cf_builtin • THREDDS catalog hook: user-supplied function modifies TDS catalog • Each type of handler may be customized • esgsetup –-handler creates skeleton package • Fill in required classes, functions • Creates a python package, independent of esgcet • Requires knowledge of Python • http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/customizing-the-esg-publisher-with-handlers/
CMIP5-Specific Publication • Follow the DRS specification • dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s • Directory layout • Publisher allows any layout • Good idea to follow DRS-recommended layout if possible • Drslib has tools to manage DRS-style layout:http://esgf.org/esgf-drslib-site/ • Use date-style versions • version_by_date = true • Generate data with CMOR • Make sure esg.ini is up-to-date with CMIP5 controlled vocabulary
Publication to PCMDI Gateway • Install the latest esgcet package • Check that the CMIP5 project configuration is up-to-date • Each publishing institution for the PCMDI gateway has an associated group: • BCC, CCCMA, CMCC, CNRM, DIAS, GFDL, NCCS • Different from data-producing institution: DIAS publishes MIROC and MRI data • Each institution has publication (write) access, optional administrative access • Publishing institution chooses group name • A top-level dataset exists for each group: • pcmdi.BCC, pcmdi.CCCMA, … • Initial read access is restricted, for test publications. • When datasets are ready for distribution, read access will be granted to the CMIP Research group.