80 likes | 176 Views
Discussion: Dakota Results Database. Brian M. Adams March 12, 2013. Why a Dakota Results Database?. Primary driver: Dakota executable users want more uniform, centralized access to output from Dakota iterative studies Library mode users want the same, via C++ interface
E N D
Discussion: Dakota Results Database Brian M. AdamsMarch 12, 2013
Why a Dakota Results Database? Primary driver: Dakota executable users want more uniform, centralized access to output from Dakota iterative studies Library mode users want the same, via C++ interface Initially focused on results from an Iterator (method) Run configuration (reproducibility) information Extensions possible to interface, approximation, transformed evals; iteration history and details; metadata For memory limited cases, push data out of core memory after computing, pull back in for results reporting (serialization may be more appropriate) More broad design notes at https://software.sandia.gov/trac/dakota/wiki/DatabaseDesign
Initial High-level Requirements • Store results from most common studies; defer function evaluation data to restart database • Include enough metadata for user to directly locate/extract • In-core and file; options for when to sync between them • Initial file format goals both human-readable and machine parse-able: simple text, HDF5, YAML/XML, SQL • Avoid duplication of data • In-core database may replace class data • Don’t store labels many times • Avoid re-computation, reimplementation when possible
Progress through Jan. 31, 2012 • Surveyed various data output by Dakota iterators (see Trac) • Initial discussion October 2012; design reviews and discussion on December 5, 2012 • Initial implementation delivered in Dakota 5.3 • In-core boost::any database, with option for array-based storage • Simple dump to pseudo-hierarchical annotated text file • Coverage of “most” results output: focused on most common • Option to add metadata with any archived result • Demonstrated archiving LHS moments at compute, loading at print • Does not address concerns with duplication, out-of-core, re-computation, re-implementation. No YAML or HDF5. • Show example of text results output for hybrid optimization, sampling, PCE, helper iterator (PCE, EGO)
Current Abstractions • ResultsManager: manages in-core and file based databases under the hood • Post data to ResultsManager through API using concrete types • Under the hood, gets stored in boost::any or passed to file • ResultsEntry: used to retrieve a results from the database • If in-core active, manages a reference to the stored data • If not, loads from file and manages a reference to a contained data object • Allows retrieval of a single entry in an array to support per-function restore of data
Storage Types: dakota_results_types.hpp Data key: method_name, method_id, execution number, data labeltypedeftuple<string, string, size_t, string> ResultsKeyType; Data value: boost::any, currently supportingRealMatrix Array of: RealMatrixRealVector (typically per-function) RealVectorStringVectorStringVector Metadata: metadata label, vector of strings typedef map<string, vector<string> > MetaDataType;
Initial Design: Lessons / Challenges • Unique identifiers for all methods/instances run, including helper iterators • Structure/hierarchy vs. flexibility/extensibility • Best storage of data likely different than current class member and output organization • When to do per-function vs. contiguous data set • How to handle highly ragged or conditional data (different moment types per function) • PCE coefficients or Sobol indices may be stored in a matrix, but want to be able to write/read them one function at a time. • Group a best point together with it’s functions, constraints, or store variables together in an array, functions together in an array • Dealing with Dakota::String and Boost multi-array of string
Discussion: Results DB Next Steps What do you want from this capability as a user? As a developer? What kinds of queries do you want on this data? Important to be able to slice multiple ways, or can that be done in other tools? How do other tools handle this kind of output? Should we focus first on just getting the output out, then on efficiency issues, class reorganization, etc., or attempt all at once?