130 likes | 300 Views
ILDG File Format. Chip Watson, for Middleware & MetaData Working Groups. Outline. The (Real) Requirements Soft Requirements Issues Options Status Proposal. The Real File Format Requirements. Must be able to share configuration files Find and retrieve the files
E N D
ILDG File Format Chip Watson, for Middleware & MetaData Working Groups
Outline • The (Real) Requirements • Soft Requirements • Issues • Options • Status • Proposal ILDG 5 Workshop, Chip Watson
The Real File Format Requirements • Must be able to share configuration files • Find and retrieve the files • Addressed by meta data catalog, middleware components • Consume (use) foreign files • Potential implications on how to produce files & meta data • Must have a (recommended) way to keep correspondence between binary data in files and the full meta data in the MDC • Must not keep mutable (changeable) meta data within the binary files • Otherwise maintenance is too painful ILDG 5 Workshop, Chip Watson
Soft Requirements Making foreign files useable: format should… • Adapt easily to variability in binary data type • single / double precision • byte ordering (consensus seems to be big endian) • 3x3 or 3x2 (consensus seems to be 3x3) • Support data integrity checks • CRC, plaquette • Allow additional (collaboration specific) data to be included • Make it easy to skip over uninteresting pieces ILDG 5 Workshop, Chip Watson
Issues • How to incorporate legacy data? • Convert & re-store? • Provide conversion utility (convert at use)? • How to include collaboration specific preferences or standards? • Certainly want to avoid double storing data (collaboration specific format, and ILDG format) • Simplicity vs flexibility… • Flexibility (to address everyone’s desires) comes at a price; can the price be kept low enough? ILDG 5 Workshop, Chip Watson
General Approaches • Virtual shared format (different formats, common way to read, hide actual storage format) • binX as universal reader • Collaborations provide binX description OR • C code as reader • Collaborations provide C code • Need to develop a common calling convention (API) • Physical shared format • Data retrieved within ILDG is in this format • May require double storage, or conversion on the fly OR • Translation tools are provided by each group ILDG 5 Workshop, Chip Watson
Option 1: Binary-only Files Implications: • Meta data exists only in the MDC • Users must keep the correspondence between the file copy and the meta data • File naming conventions (Global File Name, GFN) OR • Local database to track correspondence file : GFN ILDG 5 Workshop, Chip Watson
Option 2: NERSC style Meaning: • ASCII header containing essential meta data Implications: • Develop new standard for header • Can include GFN, to allow retrieval of other meta data from MDC ILDG 5 Workshop, Chip Watson
Option 3: Structured File Format • Goal: encapsulate, in an extensible way, binary data and meta data within a single file • Good Candidate: LIME / SciDAC-derived format • SciDAC software committee considered several possibilities for encapsulation (including tar, cpio) • DIME (Microsoft Direct Internet Message Encapsulation) similar in approach to MIME, used for e-Mail attachments, was considered a good fit http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnservice/html/service01152002.asp • LIME == LQCD modification of DIME to be a bit simpler, and support 64 bit sizes for records • Software implementation (library) exists ILDG 5 Workshop, Chip Watson
Option 3 (cont): LIME Details: • File has multiple messages, messages have multiple records • Record format: • 32 bits: 3 flags, id-length (13), type-format (3) type-length (13) • record id (variable length, round up to 4 byte multiple) • record type (variable length, round up) • data length (64 bit – DIME was 32) • payload (round up) • SciDAC Records contain either XML meta data (string), or binary • Possible records: • ILDG meta data (XML) • binX descriptor for binary layout • Collaboration specific extensions • Binary data (stored using NERSC conventions) • ILDG meta data record options: • Existing configuration schema (subset, non-mutable) • OR, new, simpler (flat) schema ILDG 5 Workshop, Chip Watson
ILDG record idea (minimalist, from Carlton): <?xml version="1.0" encoding="UTF-8"?> <ildgFormat> <version> 1.0 </version> <endian> big </endian> <precision> 32 </precision> <lx> 20 </lx> <ly> 20 </ly> <lz> 20</lz> <lt> 64 </lt> </ildgFormat> This is a bit more verbose than the NERSC ASCII header, but is completely extensible (add new fields without breaking old applications), and the string can be parsed by standard XML libraries (which are already planned to be used for ILDG meta data). ILDG 5 Workshop, Chip Watson
Current Status • ILDG board mandated a solution to file formats to be found prior to this workshop (missed goal) • There is a wide range of opinions on best path forward (XML, NERSC format, pure binary) • There may be a current movement towards accepting XML and LIME ILDG 5 Workshop, Chip Watson
Proposal • Current ad-hoc committee to work out implications of adopting LIME (January 2005) • Standardize ILDG record XML schema • Produce doc, simple test codes to show usage • Compare to virtual file format, and to pure binary and NERSC-like (pro’s and con’s) and select a path forward (Jan 2005) • Refine selected approach, reaching version 1.0 by the end of February 2005 • Documentation of schema, code • C library (if appropriate), test codes available for download ILDG 5 Workshop, Chip Watson