220 likes | 352 Views
File Formats, Conventions, and Data Level Interoperability. ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion. Introduction & overview. Outline of objectives: Discuss role of standard, self-describing “File formats” in data level interoperability
E N D
File Formats, Conventions,and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Introduction & overview • Outline of objectives: • Discuss role of standard, self-describing “File formats” in data level interoperability • Summarize common file formats in use, their properties, & benefits --“data life cycle economics” • Discuss criteria for choosing a file format, matching it to needs of consumer/producers. • Discuss critical role of Conventions – any file format needs good recipes to make them interoperable! • Examples: NASA Measures F/T, SMAP, AIRs, Aura
Role(s) Of File Formats in Interoperability • File formats represent versatile “packages” for multi-dimensional science data and metadata. • Offer self-describing “well-known structures” to codify desired, common conventions and practices. • Offer well-documented reference cases to encapsulate specific data models. • Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability • Enhance Mission-to-Mission continuity
Why (and how) are file formats important? • Standard formats • Come with thorough documentation • Provide good Reference implementations • Common formats • More datasets in a format more tools that read that format • Canonical structures and names general purpose handlers for coordinates, etc. smarter tools
A generic work flow… • Consider user community needs and culture, fit within architecture, institutional policies & preferences • Choose a standard file format (or sub-variant) • Design a convention-enabled, specific internal layout with metadata interfaces • Prototype: Implement in prototype, evaluate • Implement in production context • Integrate within discovery and catalog environments (Catalog interoperability…)
Examples of standard file formats • HDF5 – a file format on its own, as well as a broad foundation for others • netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1) • v4 Classic (widespread adoption, some limitations…) • v4 Enhanced (support Groups, User-defined, variable length types, and more) • netCDF v3 Classic (legacy+ , tools+, but limited) • HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura… • HDF4 – legacy, extensive use by MODIS Terra, Aqua • Many other domain-specific, less generic formats abound… (need transform tools to/from HDF?)
Some selection criteria… • Do file-format’s capabilities support required functionality? • What is breadth of acceptance, adoption within larger community? (and/or, does institutional policy dictate a specific format?) • Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support? • Contribution to investment, data life-cycle economics? • What is the level of standardization? • Adaptability of format to widely used conventions like CF 1.x, or other accepted convention(s)?
Internal Layout / Design(once format is chosen & adopted…) • Define &refine High level organization /structure • /DATA • /METADATA • Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’ • Dimensions, Coordinate Variables, projection attributes • Missing_data, _Fillvalue vs. internal fill value • Units, Gain, offset, min, max, range, etc. • Prototype it! • Leverage script environments (Python H5Py, PyTables, etc) • Panoply, HDFView also quick, useful for prototyping, feedback
Using “Groups” • HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc. • Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system • Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…) • Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.
Example(s) of File Formats In Action • HDF5 – NASA Measures • NASA Measures Freeze/Thaw (soon available at NSIDC) • http://measures.ntsg.umt.edu/sample_2007_day180.zip • AQUA AIRS Level 2 (from earlier talk): • http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/2010/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf • Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)
Example: NASA Measures Freeze/Thaw, Daily in HDF5 Metadata Block: Attributes
Example: NASA Measures Daily Freeze/Thaw in HDF5 Data Variable (FT_SSMI) and Attributes
Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout
CF Conventions & file formats:--how they contribute to interoperability. • CF v1.4.x -- the term “CF” is now broader than just climate-forecasting! • Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology • CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance. • Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.
Attributes vs. Metadata?one man’s ceiling is another man’s floor… • Collection level vs. Data Set vs. Granule level • Structural vs. science-content • Swath vs. grid vs. point • Commonly used attributes: • CONVENTIONS attrib, communicates which convention was used • Basic globals: title, history, institution, source, references • Coordinate variables, axis, formula_terms • Units, _Fillvalue, missing_data, valid_range • Short_name, long_name, other provenance • (gain,offset /scale_factor,addOffset), etc.
Challenges? (just a few remain…) • Evolution, bifurcation, asymmetric support can result in occasional user confusion: • HDF v1.8.x vs. v1.6.x families? • NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3? • HDFEOS5 vs. HDFEOS2? • Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor… • Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg! • Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)
Resources : URLs • Climate Forecast (CF) Conventions (now at 1.4.x): • http://cf-pcmdi.llnl.gov/ • http://cf-pcmdi.llnl.gov/documents/cf-conventions • HDF: • http://www.hdfgroup.org/HDF5/doc/index.html • HDFEOS • http://www.hdfgroup.org/hdfeos.html • http://hdfeos.org/software/aug_hdfeos5.php • NetCDF: • http://www.unidata.ucar.edu/software/netcdf/ • http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html • General: • http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats • http://en.wikipedia.org/wiki/List_of_file_formats
Resources: File format related Tools • Panoply: http://www.giss.nasa.gov/tools/panoply/ • HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/ • OpenDAP: http://opendap.org • IDV: http://www.unidata.ucar.edu/software/idv/ • McIDAS: http://www.unidata.ucar.edu/software/mcidas/ • Python: • h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, • PyTables: http://www.pytables.org/moin • Perl: PDL-IO-HDF5, and Biohdf? • Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs
A provisional DOI, UUID Strategy • What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered: • DOI: assigned to our reference paper, by IEEE Transactions in Geoscience and Remote Sensing • UUID recipe, seedString = www.our.url/GranuleName/Datetime8601Stamp Import uuid uuid= uuid.uuid5(seedString)