390 likes | 640 Views
Data Management and Representations in Ecce and CMCS. Theresa L. Windus Pacific Northwest National Laboratory Environmental Molecular Sciences Laboratory Molecular Science Software Group. Outline. Some “definitions” Data and task representations Ecce CMCS Summary Acknowledgement.
E N D
Data Management and Representations in Ecce and CMCS Theresa L. Windus Pacific Northwest National Laboratory Environmental Molecular Sciences Laboratory Molecular Science Software Group
Outline • Some “definitions” • Data and task representations • Ecce • CMCS • Summary • Acknowledgement 2
522.09 2.02 Data and metadata(one scientist’s data is another scientist’s metadata) H°atomiz ( ) = 0 ± kcal/mol CH3OOH [calculated, G3//B3LYP, T. Windus, more at http://...] data : value and uncertainty units: kcal/mol quantity: enthalpy of atomization species: methylhydroperoxide, CAS# 3031-73-0 temperature: 0 K calculated: G3//B3LYP creator: T. Windus using Ecce more info: http://avatar.emsl.pnl.gov:8080/Ecce/.../CH3OOH/.../GxEnergy 3
Metadata Converts Scientific Data into Knowledge • Metadata provides identification and documentation to scientific data. • Example: Attaching an owner, creation date, abstract, type to data. • Example: Tracking data to program versions, and possibly bugs for that version. • Metadata documents the context and value of the data. • Example: The theoretical atomization energy of methylhydroperoxide (and its uncertainty) from Ecce (used as input to ATcT) contains information identifying the species and the quantity, units, the theoretical method used, vibrational frequencies and geometry, reference to source file, creator, etc. • Metadata facilitates cross-scale transfer of data. • Example: Can show a chain of inputs, including input parameters and configuration files, across scales. • Example: Can retrieve literature references which describe this data. • Metadata allows users to comment on the data and its quality. • Example: Can be used for scientific peer review of data. • Metadata is necessary for effective collaboration. • Example: Scientific data becomes more usable to others when it is documented. Annotation is another term for metadata. Annotations can be added by either the data owner or a third party. 4
Data Pedigree: A Special Kind of Metadata • Data pedigree or data provenance is a relationship which provides a “line of ancestors”. • Pedigree allows for the categorization and tracing of the scientific data, and for the identification of the data’s ultimate origin, possibly across scales. • Pedigree includes the series of steps necessary to reproduce the data. • Data is linked, for example, to projects, references, inputs, and outputs. 5
Knowledge Grid • A set of scalable tools, middleware, and services • For the creation, analysis, dissemination, evaluation, and use • Of data, information, and knowledge • By individuals, groups, and communities …A digital place for performing ‘all’ aspects of science 6
Ecce – Extensible Computational Chemistry Environment comprehensive problem solving environment common graphical user interfaces scientific modeling management seamless transfer of information between applications persistent data storage through DAV integrated scientific data management tools for ensuring efficient use of computing resources across a distributed network visualization of multi-dimensional data structures http://ecce.emsl.pnl.gov NWChem – massively parallel computational chemistry program Energetics, geometries, frequencies, etc. at various levels of theory http://www.emsl.pnl.gov/docs/nwchem Ecce & NWChem 7
Distributed Authoring and Versioning (DAV) • An early web service (XML commands over HTTP) • A widely adopted standard for metadata/data transport • Put/Get data with arbitrary properties (dynamic) • Properties can be discovered and accessed independently • DASL, Versioning, Transactions, … 10
Ecce Physical Model Calculations are referred to as a “virtual document” because we distribute the structure across many physical objects. Physical collections and resources are URI addressable. Collections are unordered and allow mixed content. 15
Basis Set Tool Builder Template File Parameters Perl .edml File Calculation Editor Geometry ai.input ESP Basis Set Input Deck Basis Set Reformatting Script Theory Details Runtype Details Python Perl Calculation Setup 16
Perl Output Ecce DataBase Text Block 1 Parse Script 1 Text Block 2 Parse Script 2 Job Monitor Calculation Viewer . . . . . . Parse Descriptor Text Block N Parse Script N Output Parsing 17
On the calculation: http://www.emsl.pnl.gov/ecce:contenttype=ecceCalculation http://www.emsl.pnl.gov/ecce:resourcetype=VIRTUAL_DOCUMENT http://www.emsl.pnl.gov/ecce:createdWith=v3.2 http://www.emsl.pnl.gov/ecce:owner=d39974 http://www.emsl.pnl.gov/ecce:application=NWChem http://www.emsl.pnl.gov/ecce:theory=SCF/RHF http://www.emsl.pnl.gov/ecce:spinmultiplicity=Singlet http://www.emsl.pnl.gov/ecce:currentVersion=v3.2 http://www.emsl.pnl.gov/ecce:creationdate=Mon, 22 Mar 2004 17:24:00 GMT http://www.emsl.pnl.gov/ecce:reviewed=false http://www.emsl.pnl.gov/ecce:runtype=ESP http://www.emsl.pnl.gov/ecce:launch_machine=arunta http://www.emsl.pnl.gov/ecce:launch_nodes=1 http://www.emsl.pnl.gov/ecce:launch_rundir=/home/d39974/ecceruns http://www.emsl.pnl.gov/ecce:launch_totalprocs=1 http://www.emsl.pnl.gov/ecce:launch_user=d39974 http://www.emsl.pnl.gov/ecce:launch_maxmemory=0 http://www.emsl.pnl.gov/ecce:launch_remoteShell=ssh http://www.emsl.pnl.gov/ecce:job_jobid=13858 http://www.emsl.pnl.gov/ecce:job_path=/home/d39974/ecceruns/tracebug/esp http://www.emsl.pnl.gov/ecce:job_clienthost=arunta http://www.emsl.pnl.gov/ecce:startdate=Mon, 22 Mar 2004 17:25:11 GMT http://www.emsl.pnl.gov/ecce:version=Thu May 8 13:16:51 PDT 2003 Version 4.5 http://www.emsl.pnl.gov/ecce:state=Complete http://www.emsl.pnl.gov/ecce:completiondate=Mon, 22 Mar 2004 17:25:14 GMT DAV:resourcetype=<D:collection/> DAV:creationdate=2004-03-22T17:24:38Z DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT DAV:getetag="b2805d-1000-926a8180“ DAV:supportedlock= DAV:getcontenttype=httpd/unix-directory On the molecule: http://www.emsl.pnl.gov/ecce:empiricalFormula=H4C http://www.emsl.pnl.gov/ecce:charge=0.000000 http://www.emsl.pnl.gov/ecce:useSymmetry=false http://www.emsl.pnl.gov/ecce:symmetrygroup=C1 DAV:creationdate=2004-03-22T17:24:38Z DAV:getcontentlength=386 DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT DAV:getetag="b28064-182-926a8180“ DAV:executable=F DAV:supportedlock= DAV:getcontenttype=chemical/x-ecce-mvm Example metadata 18
title: demo type: molecule num_atoms: 1065 atom_info: symbol cart atom_list: O -2.37400 -3.09100 13.5210 H -1.91600 -2.20200 14.0480 ... pdb_list: H O5* RC 1 157D A H H5T RC 1 157D A … attr_list: -0.622300 1 1 0 0 0.429500 1 1 0 0 … atom_type_list: OH HO … num_bonds: 1028 bond_list: 2 1 1.00000 1 3 1.00000 … Example MVM file 19
XML format for Properties <?xml version="1.0" encoding="utf-8" ?><value name="CPUSEC" units="second">9.60000000000000e-01</value><?xml version="1.0" encoding="utf-8" ?><vector name="MLKNSHELL" rows="7" units="e" rowLabel="Unknown" rowLabels="1 2 3 4 5 6 7">1.99199825923126e+00 1.18803456337004e+00 3.08260463820159e+00 9.34340637068915e-019.34340635555820e-01 9.34340634042729e-01 9.34340632529639e-01</vector><?xml version="1.0" encoding="utf-8" ?><tsvectable name="GEOMTRACE" rows="5" units="Angstrom" columns="3" vectors="1" rowLabel="Atom,Coordinate" rowLabels="0 1 2 3 4" columnLabel="Coordinate" vectorLabel="Coordinate" columnLabels="X Y Z"><step number="1">0.000000000000000e+00 0.000000000000000e+00 0.000000000000000e+00 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-016.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01</step><step number="2">6.767628142309400e-15 -6.950100046595310e-09 1.390021315920880e-08 -6.239857395114590e-01-6.239857464615680e-01 6.239857534116811e-01 6.239857568867110e-01 6.239857499366001e-016.239857707869190e-01 6.239857742619920e-01 -6.239857812120860e-01 -6.239857603617700e-01-6.239857916372510e-01 6.239857846871540e-01 -6.239857777370440e-01</step><step number="3">6.549446678833860e-15 1.124467050187860e-09 -2.248938851918010e-09 -6.252750669032320e-01-6.252750631744280e-01 6.252750594456050e-01 6.252750588833910e-01 6.252750626121890e-016.252750514257610e-01 6.252750508635410e-01 -6.252750471347340e-01 -6.252750583211300e-01-6.252750428437061e-01 6.252750465725070e-01 -6.252750503012980e-01</step></tsvectable> 20
Input Parameters Crossing the Molecular to Thermodynamic Scales Data Model Optimization and Frequencies B3LYP NWChem Input File Vinoxy B3LYP Vibrational Mode Animated GIF 6-31G* Pedigree is imperative to moving data across scales. Properties NWChem Output File Properties Input Parameters Gaussian Input QCISD G3(MP2)B3LYP Hf Vinoxy NASA File Energy QCISD(T,FC) Legend Gaussian Output Vinoxy NWChem 6-31G* Ecce Input Parameters Properties Gaussian Properties Energy CMCS MP2(FC) Active Tables NWChem Input Vinoxy Pedigree - hasInput MP2 Pedigree - hasOutput G3MP2large NWChem Output Properties 21 Properties
The Multi-scale Challengefor Chemical Science • Impact of chemical science relies upon flow of information across physical scales • Data from smaller scales supports models at larger scales • Critical science lies at scale interfaces • Molecular properties, transport • Mechanism validation, reduction • Chemistry – fluid interactions • The pedigree of information matters • The propagation of data pedigree across scales is difficult • Validation and data reliability is often a post-publication process • Multi-scale science faces barriers • Normal publication route is slow • Numerous sub-disciplines employ different applications, formats, models • Centers of excellence are geographically distributed 23
Multi-scale Chemical Science Data • Unique terascale reacting flow simulation databases – collection of files @ N x Dt, and experimental data • Chemical Mechanisms – k, MB files in various formats containing collections of reaction rates and transport coefficients. Modeled using theory, validated against experiments • Kinetic rates – by measurement and computation. Tables collected, reviewed and annotated. NIST WebBook, publications • Thermo-Chemistry- Tables of ‘constant’ properties of all molecules (of interest w/data) derived from many experiments, computations, extrapolations • Quantum chemistry computations of molecular properties – data from one number to large potential energy surfaces - input to thermo-chemistry and reaction rate computations 24
CMCS Spans Scales & Geography Biggest barrier is “language” and informatics 25
Adaptive Informatics Infrastructure • Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services • Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning • Adaptive – able to dynamically change to incorporate new knowledge and support new activities • Low Barriers • Many access points • Storage of data in original formats with dynamic metadata extraction and translation • Powerful • Arbitrary formats (binary, ASCII, XML) • Integrated data, metadata, pedigree across internal and external tools • Evolvable • Schema can be changed/extended as needed • Metadata, translations, viewers, portal, etc. can be dynamically configured 26
CMCS Technical Choices Enable Adaptive, Long-lived Infrastructure • CMCS Data/Metadata services • SAM Translation, Annotation • WebDAV implementation • Notification (JMS, NED) • Search • Pedigree browsing • Core XML schema • Security (JAAS) • Chemical Science Portal • Jetspeed (CHEF) • CMCS Explorer • Application portlets • Community services • Application Integration • Webservices • WebDAV API • Multi-scale data including NIST access A diagram representing the major conceptual elements of the CMCS Informatics Infrastructure. 27
How Metadata is Populated in CMCS • SAM Metadata Services Layer • When data is put into WebDAV, SAM causes XSLTs to be executed to extract metadata from XML files, based on MIME type. • Similarly, Binary File Descriptor (BFD) provides an interface to extract metadata from binary files. • Other translators can be used as well. • CMCS data management/pedigree API to facilitate insertion and modification of metadata, in the proper XML format. • Java code which allows software developers and scientists to easily write programs to add/edit metadata. • Scientists can use these APIs to integrate with existing or new chemical science applications. • Uses open source DAV and XML libraries. • Any WebDAV client application • DAVExplorer: Java application • CMCSExplorer: Integrated in the CMCS portal 28
CMCS Metadata, Annotations, and Pedigree • Using Dublin Core for some basic pedigree properties of electronic publication: creator, dates, publisher, is-referenced-by, references, etc. • Digital library standard for metadata • http://www.dublincore.org • CMCS properties for Chemical Science to enable searching: species name, CAS, chemical properties, and chemical formula. • CMCS properties for defining scientific data: inputs, outputs, and is-part-of-project. • CMCS properties for scientific publication and peer review annotations: is-sanctioned-by. • Currently defined more than 35 elements in the core CMCS pedigree. • Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break! CMCS metadata is strongly encouraged, though not required, for all CMCS data, and CMCS metadata is highly extensible. 29
Pedigree Browsing Data is linked to projects, references, inputs, and outputs The Browser enables metadata editing. 31
Automatic Translation and Metadata Extraction Data translations provided automatically by SAM using previously registered XSLT’s for this file type. 32
launch REACTIONLAB ELN 5.0 Ecce DAV+SAM NS DAV NWChem/ GRID RESOURCES Adaptive Infrastructure Enables Application Integration Browser, e-mail Browser, e-mail MCS Portal Portlet API Shared Data Repository Portlet API Active Table SAM Mime-type Assignment Metadata Extraction Translation Pedigree Relationships SAM Web service CMCS/DAV API CMCS/DAV API Fitdat Notification Web service Notification API Notification API Grid Fabric Federation ML NIST Kinetics DB 33
Summary • Users just want to have ease of use and flexibility in viewing output – adaptive informatics infrastructure • “Standards” are useful, but it is necessary to be able to translate between diverse “schema” and “ontologies” • Metadata converts scientific data into knowledge 35
Multi-disciplinary Ecce Development Team Gary Black -- Project lead Karen Schuchardt -- Software architect lead Bruce Palmer -- Chemist architect Todd Elsethagen -- Data management lead Erich Vorpagel – Chemist consultant Michael Peterson -- Operations support Mahin Hackler -- Operations support Sue Havre -- Application development Brett Didier -- Application development Carina Lansing -- Application development Steve Matsumoto -- Online help lead Colleen Winters -- Online help Doug Rice -- Online help 36
Multi-disciplinary CMCS Team Chemical Science Computer/Information Science Christine Yang, SNL Larry Rahn*, SNL Carmen Pancerella, SNL Renata McCoy, SNL Michael Lee, SNL Wendy Koegler, SNL Ed Walsh, SNL John Hewson, SNL David Montoya*, LANL Lili Xu, LANL Yen-Ling Ho, LANL William H. Green, Jr. *, MIT Michael Frenklach*, UCB William Pitz*, LLNL Michael Minkoff, ANL Thomas C. Allison*, NIST Sandra Bittner, ANL Gregor von Laszewski, ANL David Leahy, SNL Sandeep Nijsure, ANL Al Wagner*, ANL Kaizar Amin, ANL James D. Myers, PNL Branko Ruscic, ANL Brett Didier, PNL Reinhardt Pinzon, ANL Karen Schuchardt, PNL Baoshan Wang, ANL Eric Stephan, PNL Carina Lansing, PNL Theresa Windus*, PNL Elena Mendoza, PNL SAM 37 National Collaboratory Program
Acknowledgements This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Laboratory at the Pacific Northwest National Laboratory (PNNL). The MSCF is funded by the Office of Biological and Environmental Research in the U. S. Department of Energy (DOE). PNNL is operated by Battelle for the U. S. Department of Energy under contract DE-AC06-76RLO 1830. Funding is also provided by the Mathematics, Information and Computer Science and Basic Energy Sciences Division of DOE. 38
End 39