110 likes | 259 Views
Mapping Physical Formats to Logical Models to Extract Data and Metadata. Tara Talbott IPAW ‘06. The problem & solutions. Wide range of files and formats Standard formats Prescriptive parsers Arbitrary formats Machines need to merge, parse, and generally comprehend these various formats
E N D
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06
The problem & solutions • Wide range of files and formats • Standard formats • Prescriptive parsers • Arbitrary formats • Machines need to merge, parse, and generally comprehend these various formats • Potential Solutions: • Data must adhere to a pre-specified format • Customized programs are written for each format and version • Users describe the format of their data and use tools to convert the data to a widely used and machine understandable format (e.g. XML) 2
Descriptive Parser solution- DFDL • Data Format and Description Language • Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model. • Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07” <step id="5"> <density unit="kg/m**3">935.091</density> <temp>263.227</temp> <pressure>-6.20633E7</pressure> </step> 3
Example DFDL Schema <element name=“step”> <xs:annotation> <xs:appinfo> <dfdl:repType>text</dfdl:repType> <dfdl:charset>UTF-8</dfdl:charset> <dfdl:separator>,</dfdl:separator> </xs:appinfo> </xs:annotation> <complexType> <attribute name=“id” type=“xs:integer” use=“required”/> <sequence> <element name=“density” type=“xs:float”> <complexType> <attribute name=“unit” type=“xs:string” fixed=“kg/m**3”/> </complexType> </element> <element name=“temp” type=“xs:float” /> <element name=“pressure” type=“xs:float”/> </sequence> </complexType> </element> 4
Defuddle Parser Design • An implementation of the DFDL specification 5
Capabilities • Basic • Binary/text parsing of simple types • Basic math operations • Looping • Conditional logic • Use of regular expressions for separators and terminators. • Input from multiple data sources. • Advanced • External translators • Specify intermediate layers in the data which can be used for processing, but are not reflected in the output 6
Parsing Complex Formats • Scientific formats that Defuddle capabilities have been demonstrated on: • CHEMKIN solution file • NWChem molecular dynamics property file • NWChem electronic structure output file • Microarray and Protein-Protein interaction spreadsheets • Transformations within scientific workflows to avoid custom programming • Other formats that we would like to see handled in the future… HDF, jpeg, etc. 7
What problems does Defuddle address? • Integrating different data formats, for collaboration of data generated before/without standardization. • Naming/identification of arbitrary file sub/super-structures • Long-term preservation and reading of data when the applications used to create it are no longer available. • Efficient, general data access capabilities • Random access • Data Virtualization • Multiple descriptions of the same data • Using DFDL and DFDL-1 as general subsetting/transformation mechanism • Metadata Extraction 8
Extracting metadata • SAM • DFDL+XSLT • Benefits of automatic provenance/annotation capture • Example use: Microarray data – extracting header information • Application to Provenance 9
Discussion • Challenges • Efficient and Generic – Is it possible? • Size • Variable length text • Data Virtualization, providing an abstract view of the data, independent of underlying storage system • Naming of data subsets, map name to reference of logical model, not physical. Eg: //step[5]/pressure <step id="5"> … <pressure>-6.20633E7</pressure> </step> 10
Questions? • http://sdg.pnl.gov • http://defuddle.pnl.gov • http://forge.gridforum.org/projects/dfdl-wg • Tara.Talbott@pnl.gov 11