670 likes | 679 Views
Learn about XML in scientific computing through several case studies, including XSIL, XDMF, ChemicalML, and Gateway Application Descriptors. This overview explores the usage and manipulation of XML data in scientific computing.
E N D
XML for Scientific Computing Several case studies for XML data in scientific computing
Overview • We will present case studies of the following systems • XSIL: Extensible Scientific Interchange Language • XDMF: Extensible Data Model and Format • Discipline Specific XML: ChemicalML • Gateway Application Descriptors (plus Castor) • XML by itself is just markup, like HTML without a browser. Each of the above uses a related set of software to manipulate the XML data. • We present several examples of XML to give you an overview. • These are not tutorials, just examples of how others are using XML. • We conclude with some remarks about standards for science applications.
Overview of Case Studies • XSIL and XDMF are examples of representing (meta)data for scientific computing. • Concentrate on data structures, data I/O. • Meaning of data not described. • ChemicalML mark up domain specific data. • Meaningfully describes data content. • Gateway application XML metadata describes science codes themselves. • Codes, host computers, queues • All possess a data object model. • Object oriented data descriptions guide the markup tag definitions.
XSIL XML tags for generic scientific data markup, with related Java software.
XSIL • Developed in support of several projects involving CACR at Caltech. • Example: LIGO, Digital Sky • Roy Williams, CalTech. • See http://www.cacr.caltech.edu/SDA/xsil/ for more information and free software. • XSIL developed for astronomical and gravitational wave communities. • But provides general purpose tags. • Also comes with software for building Java applications that manipulate, display XSIL documents.
XSIL Tags • XSIL defines a small number of tags • XSIL: base container for the object model. • Comment: • Param: an arbitrary name/value pair • Time: describes time, plus format • Table: data in columns and rows • Array: table data with specific size • URL: and • Streams: for handling data • We’ll now go over some of these in detail..
The XSIL Tag I • XSIL documents map to a document object model with associated handling code. • The root tag for XSIL is <XSIL>: <XSIL Name=“Example” Type=“Examples.MyExample> … </XSIL> • Type points to the Java code that should process this file. • It’s some file called MyExample.java in the package Examples.
The XSIL Tag II • XSIL tags can be nested if different parts of the XSIL document need to be handled by different codes. <XSIL Name=“Example” Type=“Examples.MyExample”> … <XSIL Name=“Subsection” Type=“Examples.Subsection”> … </XSIL> </XSIL> • XSIL tags thus are the base container in a generic object hierarchy. • MyExample object “has a” Subsection object
More On Object Containers • Consider an Electromagnetics example: • A target is represented as a grid for finite difference integration of Maxwell’s eqns. • The base input file contains one or more materials. • Each material has specific EM properties. • If translated to XSIL, could look like this: <XSIL Name=“EMRoot” Type=“CEA.Root”> <!– Some general parameters --> <XSIL Name=“EMMaterial” Type=“CEA.Material”> <!– Some info describing the material. --> </XSIL> </XSIL>
Parameters • Each XSIL tag can contain one or more parameters. • Params are arbitrary name/value pairs. • Params optionally have units. <XSIL …> <Param Name=“Color”>Red</Param> <Param Name=“Weight” Unit=“kg”>3.14</Param> </XSIL>
Tables • Params associate one value per name • Tables support multiple values • A Table row can have any number of values. • Each table contains column definitions followed by an arbitrary number of entries. • Tables get data from Streams (discussed later).
Example Table <XSIL…> … <Table> <Column Name=“Color” Type=“string” /> <Column Name=“Weight” Type=“float” Unit=“kg” /> <Column Name=“Length” Type=“float” Unit=“meter” /> <Stream Type=“Local” Delimiter=“,” > “Red”,100.2,0.2 “Green”,21.7,1.2 </Stream> </Table> </XSIL>
XSIL Arrays • XSIL arrays are similar to Fortran and C arrays. • For mixed type data, use Tables. • If all data is the same (integers, floats), use Arrays. <Array Type=“int” > <DimName=“x-dim”>2</Dim> <DimName=“y-dim”>2</Dim> <Stream Type=“Local” Delimiter=“,”> 137,42 8,13 </Stream> </Array>
XSIL Streams • XSIL Streams can be used to load data • Data sources can be • In the file itself (as shown in previous examples). • From files on disk • From URLs (http://, ftp://, and file:// supported) • Loading data from disk <Stream Type=“Remote” Encoding=“Littleendian”> /home/user1/data/datafile.dat </Stream> • Loading data from URLs <Stream Type=“Remote”> http://my.server.edu/XSILdata/datafile.dat </Stream>
Ex: Use XSIL to describe input data <XSILName=“InputData” Type=“Examples.InDataHandler”> <XSILName=“Target 1” Type=“Examples.Target”> <Param Name=“Target”>Scud</Param> <Param Name=“dx”>0.1</Param> <Array> <DimName=“X-Dimension”>100</Dim> <DimName=“Y-Dimension”>100</Dim> <StreamType=“Remote”> /home/mpierce/data/mydata.dat </Stream> </Array> </XSIL> <XSILName=“Target 2” Type=“Examples.Target”> <!– Another target --> </XSIL> </XSIL>
Table and Array Types • Table and Array data can be (in bits) • boolean (1) • byte (8) • short (16) • int (32) • long (64) • float (32) • double (64) • floatComplex (64) • doubleComplex (128) • string (arbitrary length)
Using XSIL • The previous example just marks up data. • XSIL also comes with Java bindings that • Read the file and parse it. • Extract parameter values, units, etc. • Read in and manipulate tables, arrays • Central ideas: • Each XSIL tag corresponds to a Java class • XSIL’s Type points to your custom driver code that uses the XSIL classes.
XSIL Coding Example • Consider following small XSIL example <XSIL Type=“Examples.MyExample”> <Param Name=“x0”>12.0</Param> <Param Name=“dx”>0.1</Param> </XSIL>
XSIL Java Code Example package extensions.Examples import org.escience.XSIL public class MyExample { String x0,dx; XSIL root; public MyExample(String xsilFileName) { root=new XSIL(xsilFileName); } public void construct() { for(int i=0;i<root.getChildCount();i++) { XSIL x=root.getChild(i); if(x instance of Param) { Param p=(Param)x; if(p.getName().equals(“x0”)) x0=p.getText(); if(p.getName().equals(“dx”)) dx=p.getText(); }}}}
Code Notes • All classes (Param, Table, etc.) extend the XSIL class. • Pass the XSIL class root the XSIL path through the constructor. • XSIL handles all parsing • XSIL class defines getChildCount(), getChild() methods. • Param class defines getName() and getText() methods.
XSIL Summary • Defines a small set of general purpose tags for scientific data. • Data itself is not directly marked up. • Read in through streams • XSIL software maps Java classes to XSIL tags. • Convenient for working with XSIL docs. • DOM classes are much more cumbersome to use.
XDMF A data model geared toward finite element codes, with associated software in C++, Java, and TCL
ICE and XDMF • ICE (Interdisciplinary Computing Environment) is a comprehensive project at ARL MSRC for providing a common software platform for DoD scientific codes. • Jerry Clarke, lead developer • XDMF (Extensible Data Model and Format) provides a common data format for several different codes • Primary focus: finite element codes for fluid dynamics and structural mechanics. • XDMF and related software provides the backbone for loosely coupling applications and visualization.
XDMF Design • XDMF divides data into “light” and “heavy” types. • Light data, or metadata, is formatted in XML and will be described in more depth. • Heavy data is in HDF5 and not presented here.
XDMF Basic Concepts • XDMF basic tags are <DataStructure> and <DataTransform> • <DataStructure> defines the actual data. • <DataTransform> defines the area of interest (AOI) in the data. • AOI defined by coordinates, a function, or a hyperslab. • <DataTransform> contains one or more <DataStructures> • The transform defines how the data structure will be filtered.
Simple Data Structure • The example below is for 655 XYZ values in the indicated HDF5 file. <DataStructure Name="Some XYZ Data" Type="Float" Dimensions="655 3"> MyData.h5:/MyXYZdata </DataStructure> • Simple character data can also be included directly the XML document.
<DataStructure Name="Connections" Type="Int" Precision="8" Dimensions="100 8" > MyData.h5:/MyConns </DataStructure> <DataStructure Name="Pressure" Type="Float" Precision="8" Dimensions="100"> MyData.h5:/MyPressure </DataStructure> Data Structure for Mesh Connections and Pressures
Data Structure Attribute Summary <DataStructure Name= "Any name " Some meaningful name to the owner Rank="NumberOfDimensions" Redundant information Dimensions="Kdim Jdim Idim" The slowest varying dimension is listed first Type="Char | Float | Int | Compound" Default is Float Precision="BytesPerElement" Default is 4 Format="XML | HDF" Default is XML >
XDMF Array Types • XDMF array entries can have these types: • Integer • Float • Char • All are 4 bytes by default, can be increased to 8 bytes.
DataTransform • DataTransform defines a way for the raw data to be filtered • Gives a certain Area of Interest in data set. • Possible transforms: • Coordinate: Select an particular area • Function: Define simple algorithm for selecting area • Hyperslab: Define start, stride, and count for each dimension of an array.
Hyperslab Transform Example • The following markup instructs the processing code to apply an Hyperslab transform to a 4-D array. • The first data structure defines the hyperslab: • 0000 are the starting points for each dim • 2221 are strides (step sizes) for each dim • 25 50 75 3 are the number of steps for each dim • The second data structure gives the raw data, a 100x200x300x3 array in the noted HDF5 file. • The resulting region starts at [0,0,0,0], ends at [50,100,150,2] and includes every other plane of the untransformed data.
<DataTransform Dimensions="25 50 75 3" Type="HyperSlab"> <DataStructure Dimensions="3 4" Format="XML"> 0 0 0 0 2 2 2 1 25 50 75 3 </DataStructure> <DataStructure Name="Points" Dimensions="100 200 300 3" Format="HDF"> MyData.h5:/XYZ </DataStructure> </DataTransform> Hyperslab Transform Example
<DataTransform Type="Function" Function="( $0 + .022 ) * ( $1 / 2.0 )" Dimensions="2 3"> <DataStructure Dimensions="10 20“ Format="XML">1.1 1.2 1.3 2.1 2.2 2.3 </DataStructure> <DataStructure Dimensions="2 3“ Format="XML“> 2 3 4 4 3 2 </DataStructure> </DataTransform> Function Example
Explanation of Function Example • The function defines a simple data transform that creates a new data set from the existing ones. • In the example, the function takes elements one at a time from the first ($0) and second ($1) sets. • First resulting value: • (1.1+0.22)*(2/2.0)=1.32 • Second resulting value: • (1.2+0.22)*(3/2.0)=2.13
Data Organization • DataStructure and DataTransform constitute XDMF’s data representation. • This specify raw data up to array structure • XDMF Domain tags are used as arbitrary containers. • Domains contain Grids which specify data model; • Grids contain Topology’s, Geometry’s and Attributes, as well as Datastructures. • Topology specifies connectivity between points • Geometry specifies points • Attributes include Scalars, Vectors, Tensors and specify field values
<DomainName="Example #1"> <GridName="My Hex Grid with Pressure"> <TopologyType="Hexahedron" Dimensions="100" Order="7 6 5 4 3 2 1 0"> <DataStructure Name="Connections" Type="Int" Precision="8" Dimensions="100 8" > MyData.h5:/MyConns </DataStructure> </Topology> (continued in next column) <GeometryType="XYZ"> <DataStructure Name="XYZ Data" Type="Float" Dimensions="655 3"> MyData.h5:/MyXYZdata </DataStructure> </Geometry> <AttributeType="Scalar“ Center="Cell"> <DataStructureName="Pressure" Type="Float" Precision="8" Dimensions="100"> MyData.h5:/MyPressure </DataStructure> </Attribute> </Grid> </Domain> A Full XDMF Example
Review of Example • Recall XDMF is primarily for structured and unstructured finite element grids. • Input data includes grid connectivity info, grid geometry, and pressure values • The Domain contains a Grid • The Grid is defined by Topology, Geometry, and Attributes. • Topology, Attributes, and Geometry contain data sources and structure info.
XDMF API • Like XSIL, XDMF treats the XML markup as a set of instructions to be processed by actual programs. • XDMF defines an API of document processing engines. • Core is in C++ • ICE also provides Java and TCL APIs through wrappers around core. • See http://www.arl.hpc.mil/ice/Examples/CodeIntegration/DemoIceRt.cxx for code example.
XDMF Summary • Provides a few general purpose tags • Again, data is not directly marked up. • Stored in HDF5 • XDMF handled programmatically with APIs in C++, Java, Tcl. • More information: • http://www.arl.hpc.mil/ice/
XSIL Larger tag set Java API Can read data that is in document, on disk, from URL Questionable performance and memory efficiency for very large data sets. Free and open source XDMF Uses HDF5 for large data sets. C++, Java, TCL APIs. Defines both data structures and transform instructions. Supports arrays, but not mixed data types (such as XSIL Tables). Integrated with ICE Comparison of XSIL and XDMF
Chemical Markup Language A domain specific XML markup language.
CML Introduction • XSIL and XDMF use XML to describe code input files and give simple processing instructions. • Tags describe data structure, not content. • We now examine a domain specific example, the Chemical Markup Language. • Other domain markup languages: • Mathematics Markup Language (MathML) • Geography Markup Language (GML)
XML for Chemistry • Goal: provide a common chemical data format that is an open, universal standard. • Data representation is platform independent • Support structured searches of data banks. • Provide a common format for software (particularly visualization). • Support multidisciplinary data formats (biology, math) through XML namespaces. • Provide a data object hierarchy suitable for object oriented programming.
CML Structure • Chemistry lends itself to object container structure • Atoms have protons, neutrons, electrons • Molecules have atoms • Complex molecules and compounds are composed of molecules, molecular pieces (benzene rings, for example) • CML defines these as data objects with property fields
<moleculeconvention="MDLMol" id="glycine" title="GLYCINE"> <dateday="22" month="11" year="1995"> </date> <atomArray> <atomid="a1"> <stringbuiltin="elementType"> C</string> <floatbuiltin="x2">0.6424</float> <floatbuiltin="y2">0.4781</float> </atom> …. </atomArray> <bondArray> <bondid="b1"> <stringbuiltin="atomRef">a1</string> <stringbuiltin="atomRef">a2</string> <stringbuiltin="order">1</string> </bond> …. </bondArray> </molecule> A Simple Example: Glycine
Previous Slide • Browser tool, Jumbo-3.0 • User can display dozens of CML’d molecules. • Molecules can by rotated in display. • Display is rendered in SVG (Adobe plugin for XML based 2D graphics). • Molecule displayed is cholesterol. They also have glycine in database, but not as exciting to look at.
Gateway Application Descriptors Describing scientific applications themselves with XML and mapping to Java with Castor. http://www.gatewayportal.org
Gateway Application Descriptors • Gateway is a computational web portal for securely submitting and monitoring jobs, transferring files, and archiving information. • Gateway describes scientific applications and host computers with XML metadata. • This is used to provide general purpose tools that can be used to build portals for specific applications.
Application Descriptors • Gateway describes scientific applications and host machines in XML. • This is used to generate HTML forms needed to collect information needed to create batch queuing scripts and job submission. • The general object container scheme is • Portals contain applications • Applications contain hosts • Each also has a set of descriptive parameters.