EXPRESS/Binary Report

EXPRESS/Binary Report David Price ISO TC184 SC4 Toulouse June 2006

Agenda • Status since last ISO STEP in Italy (added) • Walkthrough of current EXPRESS/HDF5 mapping • Presentation of prototypes and testing results • Issue discussion for next draft of mapping • Next actions and plans for testing

March 2006 Italy STEP Meeting Report Items • Workshop hosted by HDF Group • Workshop Dec 6-8, 2005 • Champaign, Illinois, USA • STEP, ESA, commercial, EXPRESS/Binary and HDF 5 developer attendees • Agenda was • Introduced HDF Group to EXPRESS language and STEP information models • HDF developers provided overview of HDF 5 Concepts and Structures • Walkthrough of EXPRESS/HDF Mapping Draft 0.2 • Presentation by domain experts : AP209 Analysis, STEP TAS, SINDA/G, Ship AP Analysis Needs • Issues/requirements around APIs, programming languages, etc.

Summary Reported at March 2006 Italy STEP Meeting • Many core issues on V0.2 spec addressed at the Dec 2005 workshop at HDF Group US facilities • The basic approach was flawed, V0.2 did not use enough of the HDF capability • V0.3 will be an improvement and should allow better control of efficiency by the application • http://www.exff.org/express_binary • Prototyping will follow V0.3

March 2006 Italy STEP Meeting Action Items • David Price – Publish EXPRESS/HDF Mapping V0.3 due March 24 • Mats Lindeblad – Create New Work Item for June SC4 meeting • David Price - contact Hans-Peter about linking a one-day workshop with the NASA/ESA PDE at the end of April (a day before Monday?) • Keith Hunten – plan session at Eng Analysis sessions at PDES, Inc. Offsite end of March • David/Mats – plan for technical work at June SC4 meeting

Progress Since March • V0.3 published • Short requirements session at PDES, Inc Offsite where the EA team prioritized • Add SELECT • Add redefined attributes (does HDF support this?) • Add schema version attribute (may use URN) • What kind of metadata does NARA required? • National archives project • Also, need a EXPRESS-to-C software to lower barrier to participating in prototyping

Progress Since March (2) • One-day workshop held with pyEXPRESS prototype team lead by Alain Fagot and Hans-Peter • David Price Slides/Notes are available • Post-workshop plan to produce V0.4 • EA requirements • better examples • Incorporate feedback/issues from pyEXPRESS • Editor (i.e. David Price) could not provide sufficient time to the project to produce V0.4 or the EXPRESS-to-C software before June vacation • V0.31 was published June 9 adding proposal for subset of SELECT types (one of the EA team priorities)

Current Mapping Walkthrough

Prototypes and Testing results • pyEXPRESS testing (slides from PDE workshop) • Subset of EXPRESS (e.g. no complex instances) • Based on pyTables 1.3, HDF 1.6.5, Python 2.4 • Using same EXPRESS-based API for P21 and HDF access • HDF is just another backend to the pyEXPRESS API • This is a different approach from what is assumed by the EXPRESS/Binary team where direct HDF API access was assumed (is “programmer ease of use” a very high priority?) • Compression (using ZLIB) and chunking make file smaller and more efficient for read/write • Even PC processors are powerful enough that decompression is faster than file access as HDF lets you only read into memory what you need at any given time • Benchmarks show good results (e.g. 10-50% file size and 75% access times), but also identify areas in the mapping that need improvement (e.g. small HDF files are bigger than P21 and sometimes slower) • STEP TAS will be a NWI in SC4 starting soon

Issue discussion for next draft of mapping • <Technical work goes here> • David can edit source XML for V0.4 draft to include issue resolution we develop today • EA needs • Check V0.31 SELECT support (DONE) • Add redefined attributes (does HDF support this?) (DONE) • Add schema version attribute (may use URN) • pyEXPRESS Cannes issues • Object ID (i.e. pointers) handling code ID = Integer + string (string is pyTable name, generated from EXPRESS name) (DONE) • Unset values for each datatype within the file (DONE)

Issue discussion for next draft of mapping (2) • Issues • Complex/partial entity instances (ANDOR) (DONE) • David Issue = (Multiple) Inheritance? Had something to do with select types. (DONE) • Defined type of array “TYPE x = aggregate of whatever” (TODO) • Complicated types for array values e.g. SELECT (REAL, INTEGER, ENTITY INSTANCE) (DONE) • We will use the same generic object identifier approach to handle these as to handle complicated SELECT types. • Variable length string • HPdK thinks that these cannot be put in a HDF Compound Datatype. Georg found where it the UG seems to say this is allowed 7.1 Complex combinations of datatypes. Maybe it’s a limitation of pyTables? • The current mapping says use Varaible length datatypes but it’s not clear if that’s allowed in a Compound Datatype. • We may have to use the general purpose object id capability and have a dataset somewhere containing varying length strings (or find another solution). It does look like you may have to specify the maximum length of the varying length strings. • (DEFER TO EMAIL WITH HDF)

Instance identifiers • Every hdf5 link and hdf5 dataset has an hdf5 object id that is an unsigned 32/64 bit integer • Issue : Is there a problem with using 64 bit integer as part of entity instance ids on a 32 bit platform (i.e. does this place a limit on file size or interoperabilty?) • H-P thinks the object ids are managed inside a hash table in HDF • Also thinks the object id is not exposed in the hdf API everywhere that we need it • Proposal is to use a tuble of integers that can be used for both an entity instance id and a pointer into the aggregates • (hdf object id, row index)

Complicated Select types • TYPE x = SELECT OF (REAL, INTEGER, LIST OF BOOLEAN, e2); • Proposal is to have each base type in a separate HDF dataset in a separate group • Group for REAL, Group for INTEGER, Group for LIST OF BOOL, etc. • It could be configurable • May have a single dataset for ALL integers in the file used in this way • May have a dataset for each attribute used in this way (similar to how the mapping for aggregate attribute values works now) • For cases where every entity instance that has TYPE x as its domain, you might use the simple type instead of the complicated mapping

Redeclared attributes attribues • Redeclaration things we can address • specialize the attribute domain • Write the encoding of the specialized value in the HDF compound type representing the subtype • type is subtype of original • We only use the object identifier everywhere so this is no problem • rename of attribute • Use new name in HDF compound data type for the subtype • Explicit to derived • Do not put the attribute in the HDF5 compound data type and do not store a value

ANDOR • SCHEMA test; ENTITY a; name : STRING; ENTITY b SUBTYPE OF a; age : INTEGER; x : REAL; ENTITY c SUBTYPE OF a; height : REAL; x: BOOLEAN; Results in test/a test/a/name test/b test/b/name test/b/age test/c test/c/height test/b__c test/b__c/name test/b__c/age test/b__c/height test/b__c/b__x test/b__c/c__x

Next actions and plans for testing • pyEXPRESS testing based on pyTABLES, there is a C Tables API … Should our other testing be based on that? • Can/should we set up another workshop with HDF Group to complete mapping? • DP Action to talk to Mike Folk to about doing something prior to the ISO in October (we remember him saying there was a workshop in DC) • What do testers need to help get them started? • EXPRESS-to-C has been mentioned (if we use C Tables API that’s not useful) • Training? • Test data? • Schemas? • Closing plenary slides for Friday • NWI – Will be created and circulated via telecon before the next ISO STEP meeting.

Notes from Meeting • Are there other sources of MetaData? • Are there other archiving (e.g. NARA) or LTDR standards (e.g. LOTAR)? • If you treat HDF as a “database” what is needed? • What about internal company meta-data? • What about Web-based standards (e.g. Dublin Core)? • Should we just include a generic meta-data “name-value pair” capability? • What about non-STEP data in the same file that the STEP data references (e.g. jpegs)? • Where multiple mappings are still being tested, it is OK to include more than one in the specification. • The specification is currently a guide for prototype testers, not a draft standard. • What are the highest priority requirements? “Performance”, but performance and efficiency of exactly what?

Notes from Meeting (2) • We may need to add some HDF attributes to the Groups and Datasets when they are written to help readers (e.g. number of instances of an entity type that were written) • C Tables API uses this approach so we should look at that to see if we can learn anything for our use. • We need to have more discussion about whether to allow or require writing inverse attribute values into the file, nothing is done there now. • For “read-only files” inverses could be a nice optimization. • Would we need to allow this to be configured? If so, how? • What about the “unnamed inverse” that EXPRESS says exists?

Action Items • HPdK – Find out how to implement the object id using the HDF 5 API • DP – Find email thread on entity instance identifiers from a year ago, it might be useful for the new proposal • AF – Write text to describe the multi-dataset approach to Aggregate Instances, email to DP who will add to spec V0.4 • DP – Read “fixme” from meeting and fix them. • HPdK – Put example HDF5 files on the Web somewhere for others to view. Mapping document too. • ML – Look at what Vivace stuff can be published publicly. • ML – Look at What can be published to the Vivace Forum 2 (unfortunately, these are same dates as Hershey).

EXPRESS/Binary Report