300 likes | 396 Views
SimDAP Simulation Data Access Protocol. Claudio Gheller CINECA (c.gheller@cineca.it). SimDAP in a nutshell. Simulation Data Access Protocol, hereafter SimDAP , defines a standard to access numerical simulation outputs ( theoretical data ), hereafter Snapshots.
E N D
SimDAP Simulation Data Access Protocol Claudio Gheller CINECA (c.gheller@cineca.it)
SimDAP in a nutshell Simulation Data Access Protocol, hereafterSimDAP, defines a standard to access numerical simulation outputs (theoretical data), hereafter Snapshots. The goal of the SimDAP protocol is to preview and retrieve data found in a previous search phase. Since data could be huge, the SimDAP service can provide solutions to download ONLY the data of interest, reducing the communicated data volume. The SimDAP protocol describes the interface to the “data shrink” services The result of any SimDAP operation is a reference to one or more data files
It’s too large !!! Metadata VOTable I like this one Search for simulations with Lambda>0.7 Binary data file Let’s select a sub-region!!! Perform the analysis on-site Finally I have a jpeg… cannot be too large!!! Extract a sub region… it is still large Data is too large!!! SimDAP examples
SimDAP target data • SimDAP deals with Snapshots • Generally speaking, Snapshots are RAW data produced by a numerical model • In principle any set of M physical parameters in a N-dimensional phase space can be the object of SimDAP • For simplicity, we have started considering data which represents a spatial distribution of phisical quantities, in different time steps. Therefore support of space and time are assumed by default. • E.g.: • x, y and z coordinates of a set of particles at various evolutionary times • The temperature over a computational mesh • The x-ray luminosity derived from temperature and density (direct outcome of the simulations
SimDAP data model • SimDAP adopts SimDB as the standard data model. • SimDB is essential in the discovery phase (not part of SimDAP), which provides the basic input parameters to SimDAP • The experiment id (the simulation) • The result id (the snapshots) • The data provider id (as registered) • The result of the SimDAP operation is a reference to one or more file. • The result may be delivered as a VOTable describing, in terms of SimDB, the outcome of the SimDAP operation and containing the references to the data files • The data files presently have not a precise standard. Explorative work is in progress on this topic.
SimDAP protocol • SimDAP does NOT specify anything about the implementation of the related services. This is up to the service provider. • SimDAP defines only the (standard) interface to the (web) service. This means that the following items will be standard: • Service goal (what it does) • Input parameters (what it is needed to run the service) • Results (what is returned by the service) • Custom services are supported, BUT they must be fully described, possibly via registry
SimDAP services • At present the following services are expected to be part of the protocol: • Preview • Download • Cutout • Custom • Each service MUST support a METADATA function which returns the input parameter supported by the service. • This information can be used either by the client applications (in particular for custom services) or by the registry, for users seaarching for services according to their capabilities.
Preview: goals • The result of the SimDB search is a list of simulations and/or snapshots. • There is NO easy and/or standard way to understand if the content of the snapshot is fine for you. • However, you cannot download all the hits to check them. • The PREVIEW service allows you to have a pre-defined view of one or more snapshots. • Possible preview services can be based on: • Selection and download of a subset of the whole snapshot (randomized, decimated…) • visualization of the data, by 3D interactive rendering of sampled data, or orthogonal projections, • statistical analysis • Object catalogues (e.g. cluster of galaxies identified in a cosmological simulation) • … • All these functionalities could act on precalculated infos or interactively.
Preview: input parameters, result • The only mandatory input parameters are: • Simulation id • Snapshot id • Further parameters can be specified and published by the service. They allows the user to specify possible customization of the preview service. • If multiple preview functionalities are implemented, each is treated as a separate service. • The output is heterogeneous. If it is a file (decimated/reduced dataset), it must have the standard TVO format (VOTable+binary).
Download: goals • Once the snapshots of interest have been identified, the user can decide to download them. • Two possible solutions: • Direct download– the user get the data file as it is. This is part of the SimDB protocol. No further actions are required on the data. • SimDAP download – the user get back the snapshot in the standard TVO file format (VOTable+binary). A further operation may be supported and applied: fields selection. This operation allows the user to download only those physical quantities he is actually interested in.
Download: input parameters, result • If only the direct download is available, the reference to the file is enough. However this is not strictly part of the SimDAP protocol. • The only mandatory input parameters are: • Simulation id • Snapshot id • The FIELD parameter has to be supported if the fields selection is available • Further parameters can be specified and published by the service. They allows the user to specify possible customization of the download service (e.g. automatic format or endianism conversions). • The output is always a file. The expected format is the TVO format (VOTable+binary), unless explicitly specified.
It’s too large !!! Metadata VOTable I like this one Search for simulations with Lambda>0.7 Binary data file Let’s select a sub-region!!! Cutout: goals • Data could be too large to be moved from the server. • The user could be interested only in a small fraction of the data • The cutout service let the user to focus on a region of interest, extracting the corresponding data and downloading the resulting file. • In principle the cutout could be of any shape. For simplicity, SimDAP only deals with 3D rectangular selections, identified by a 3D point (a vertex or the center of the selection region) and the size of the selection box in the 3 coordinate directions. • The cutout can be completely different according to the data: regular meshes, AMR, point-like/unstructured data.
Cutout: input parameters, result • The Cutout service requires different classes of inputs • Source parameters • Simulation id • Snapshot id • Physical quantities selection parameters: • FIELD • Cutout fields parameters and corresponding units: • COORD_X, COORD_Y, COORD_Z • UNITS • Selection region parametes: • VERT_X, VERT_Y, VERT_Z • SIZE_X, SIZE_Y, SIZE_Z • Further parameters can be specified and published by the service. They allows the user to specify possible customization of the cutout service.
The UNITS problem • The Cutout function requires the knowledge of the cutout units… • Example: • The user needs to extract all the data inside a simulated volume of a cosmological simulation. He wants to use “natural” units to identify the vertex position and the box size: Mpc • However data could be stored in different units (e.g. kpc or cm!!!). • In order to make the cutout possible two basic operations MUST be accomplished: • The server MUST “send” the units to the client (or conversion factors to some “natural” units); • The client, using the units, MUST convert the input parameters.
Cutout tools The Cutout function requires proper tools to select the region of interest. The tools can be the same (or derived by) those used for the preview. An example using VisIVO…
Cutout results The Cutout result is DATA. The result data is characterized by raw data and metadata. The latter are organized as a VOTable (in general, an XML file). The VOTable describes the data and contains the acref parameter(s) to one (or more) file(s) containing the raw data. The raw data could not be immediately available (access to secondary storage devices, CPU demanding operations…). In this case DATA STAGING is necessary.
Custom Services and Service Registration • Custom services are supported. In this case the complete description of the service must be available as a registry entry • In general the SimDAP service is to be registered. This means: • Publish information about the service name and owner • Publish the URL of the service • Publish the available services (preview, download, cutout, custom…)
VOTables example: 1 VOTable for the velocity field of a fluid on a fixed 3D mesh <RESOURCE name="myVectorField" type="results" > <DESCRIPTION>Velocity Field from N-Body run</DESCRIPTION> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="VelocityField" ID="Vel" order="sequential” arraysize="41x41x41" > <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x" datatype="float" unit="km/s" /> <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y" datatype="float" unit="km/s"/> <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z" datatype="float" unit="km/s"/> <DATA> <BINARY> <STREAM acref="file:///scratch/myhome/test.bin"/> </BINARY> </DATA> </TABLE> </RESOURCE> </VOTABLE>
VOTables example 2 VOTable for the temperature field of a mesh based quantity and the position of N-Body particles extracted from the same spatial region. <RESOURCE name=myMixedData type="results"> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="Particles" ID="NBody" order="sequential” arraysize="100000"> <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x“ datatype="float" unit="Mpc" /> <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y“ datatype="float" unit="Mpc"/> <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z” datatype="float"unit="Mpc"/> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/particles.bin"/> </BINARY> </DATA> </TABLE> <TABLE name=“Mesh" ID=“MeshTemp" order="sequential” arraysize=“41x41x41"> <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian“ datatype="float" unit="K" /> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/mesh.bin"/> </BINARY> </DATA> </TABLE> </RESOURCE> </VOTABLE>
Raw data file formats • Data file formats can be different according to their usage • Archive side files should be • High performance (fast access) • Standard (portable and persistent) • Result files should be • Simple (specific I/O libraries are not required to access them) • Self descriptive (e.g. XML metadata headers) • Compressed (to minimize transfer effort) • In any case, data size is crucial. ASCII files are “deprecated”. Base64 (or similar) encoding for http transfers are to avoid. Waist of time (for conversions) and “space” (increased size).
Result files • A simple solution is represented by raw binary files with the following characteristics. • In a file more variables can be stored • Each variable represent a scalar quantity • Components of multidimensional quantities are stored as separate variables • Variables have the same number of elements but they can have different types • Variables can be stored either as Tabular or as Sequential (see next slide) • A descriptor file (XML) is associated to the binary to make it self-descriptive • Advantages: (little) standardization, simplicity, no I/O specific libraries required, fast access • Drawbacks: limited portability (endianism problem, data types), little standardization, no compression
Result files: Tabular vs Sequential Tabular files are closer to observational data, so more compatible to a standard VOTable idea. If the file contains the 3 variables vx, vy, vz, their Tabular storage is: vx(1), vy(1), vz(1) vx(2), vy(2), vz(2) … vx(N), vy(N), vz(N) This is suitable for variables (like the components of a vector) which are always accessed as N-uple. Or for data analysis tools which need (and load) all the stored variables for their goal. However it leads to poor performances if variables has to be loaded separately in memory. Loading one variable requires continuous jumps on the file.
Result files: Tabular vs Sequential (cont.ed) Sequential files are a common choice for “simulators” If the file contains the 2 variables rho and press, their Sequential storage is: rho(1) rho(2) … rho(N) press(1) press(2) … press(N) Each variable can be read with a single I/O call. This leads to high performance access to the file. This is typically required dealing with large files.
Archive files • Archive files are not “visible” to the end user. Therefore the data provider can choose any suitable format. • The choice should be in general driven by several properties: • The format should be standard and well supported, in order to ensure the preservation of the data, their portability between different computing platforms, software, compilers... (if the technology changes we don’t want to change the data) • The files should be fast and efficiently accessible, since data is large and complex operations could be necessary to handle it (e.g. extract the particles which falls in a certain region) • Various formats, with such features, are available.
File formats: HDF5 HDF5 (http://hdf.ncsa.uiuc.edu) represents a possible solution to deal with such data • HDF5 is • Portable between most of modern platform • High performance • Well supported • Well documented • Rich of tools • Flexible and extendible • HDF5 drawbacks • Requires some expertise and skill to be used • Information are difficult to access • Can be subject to major library changes (see HDF4 to HDF5) • HDF5 data files are • Platform independent (portable) • Well organized • Self defined • Metadata enriched • Efficiently accessible
VisIVO Server Services for TVO TVO archive Visualization Web Services Customizable data view
VisIVO Server Visualization Web Service VisIVO Web Service has been realized using the SOAP engine AXIS. You can write a client application using JAVA or C++ The ITVO web portal is a client application The service implements a data staging mechanism for the VisIVO Server outcomes. (.png files)
Developer guidelines: web services The ITVO web portal describes the web service classes using Class Diagrams and publishing the JAVA code ITVO Web Services are free software: you can redistribute them and/or modify uthem under the terms of the GNU General Public License V3
Developer guidelines: client side The ITVO web portal include some JAVA client easy to use and to include in your application. ITVO Web Services and client are free software: you can redistribute them and/or modify uthem under the terms of the GNU General Public License V3