310 likes | 319 Views
This thesis explores the value of an EOFS (End-of-Field-System) in product generation, highlighting its benefits, limitations, and the various aspects involved in modeling data products. Topics include data product definitions, quality analysis and translation, product generation and documentation, remote computation, and more.
E N D
Modeling Data Product Generation Bill Howe Dave Maier
Data Product Management Thesis: The value of an EOFS is the number of products it provides Limits on #’s of products • Amount of oversight for current products • Time to create a new product • Resources required to generate products Modeling Data Product Generation
Modeling Data Products Data Product Definitions (DPDs) or “recipes” • initially for documentation • “blueprint” for manual construction Modeling Data Product Generation
Beyond Documentation • Quality Analysis and Translation • calculate quality metrics from DPDs (e.g., resolution) • translate DPDs into executable network of Infopipes (meeting a quality standard) Modeling Data Product Generation
Product Generation and Documentation • management and scheduling of product suitebased on input avail, resources, dissem. req. • job shop assembly line • adaptive eventually; priorities, feedback to sensors and models • Performance Optimization • algebraic optimization • common subresults & shared scans on groups of products Modeling Data Product Generation
Remote Computation • “product kit”: final product built at consumer site • remote “product factory” Modeling Data Product Generation
Exercise: Fill in the Acronym CORMORANT • COlumbia • River • Modeling, • Observation, • Retrieval?? & • Archive… Modeling Data Product Generation
Roadmap • Vision • Status • Past • Graphical Diagram • Process Modeling • Type System • Current • Abstract Grids • Grid Functions Modeling Data Product Generation
Graphical System Description • Studied relevant files and codes to model: • Producers and consumers • Control flow • Data flow • Benefits: • understanding within the project • communication outside the project • Drawbacks: • only a ‘snapshot’ • very literal • no scheduling help... Modeling Data Product Generation
Brittle Scheduling • Contentious codes cause crashes • Annotate the diagram with cron job information? • But, it would be nice to capture real executions of all system components for careful study Modeling Data Product Generation
Instrumenting CORIE • Model the executionsof codes using a relational database • Monitor CORIE activity using SGI’s FAM technology • Try to identify bottlenecks, problem spots, and resource consumption properties • Status: we’re poised to perform further testing; some security concerns have been raised Modeling Data Product Generation
More than just processes... • The model is too close of a fit • Let’s start at a higher level... Modeling Data Product Generation
A Candidate Type System • Relevant types: • TimeSeries (TS) • ElementField (EF) / NodeField (NF) • DepthField (DF) • Ex: salt.63 = TS (EF (DF Salinity)) fort.21 = EF Depth findmax63 = TS (EF (DF a)) TS (EF a) Modeling Data Product Generation
Grid Vol select(sal<30) subgrid(Ocean) sum(grid) plumevol + Elev Vol select(sal<30) subgrid(Ocean) sum(grid) Abstract Data Product Recipes • But consider compute_plumevol: • This informal recipe seems appropriate regardless of the specifics of our data representation • This information should be captured somewhere! • Currently it’s obfuscated by c codes, and • tightly coupled with the TS (EF (DF a)) structure Modeling Data Product Generation
Topological Grid • A more general grid Gd is a collection of k-cells of dimension k, k in {0..d} • A grid function GF is a mapping from a k-cell to a value of type T GF : k-cellT Modeling Data Product Generation
Imagine a big 4d grid representing our current best data experimental ELCIRC vers forecast hindcast hindcast missing Grid Functions (GF) map grid locations to values 15º C 23.4 psu Modeling Data Product Generation
GF Magnitude GF Vorticity GF Velo N’hood Grid Functions We can derive new grid functions from our original set GF Salt GF Velocity GF Temp GF Elev GF Neighbors Modeling Data Product Generation
Benefits • Say we have recipes that involve • a grid, • some grid functions, and • some operators • So what? Well, • We can reason about data product outputs • We can optimize recipe execution Modeling Data Product Generation
GF Salt applytoall(vort) GF ??? Reasoning about Types GF Velocity applytoall(vort) GF Vorticity High level recipes can detect this kind of error before wasting compute resources Modeling Data Product Generation
an invalid transect a valid transect Reasoning about Schema GF1 subgrid(Ocean) GF2 type(GF1) = type(GF2), but schema(GF1) schema(GF2 ) since GF2 is defined over a smaller grid than GF1 • By tracking schema information through complex recipes we can: • check for errors • estimate resource requirements (big schema require big buffers) Modeling Data Product Generation
Reasoning about Quality • Say we have operators coarsen and refine which lower resolution via grouping and raise resolution via interpolation, respectively type(GF1) = type(GF2), schema(GF1) = schema(GF2), but qual(GF1) qual(GF2) GF1 coarsen refine GF2 Modeling Data Product Generation
... GF Elev ... computevol subgrid(Ocean) GF Vol ... GF Area ... GF Elev subgrid(Ocean) ... computevol GF Vol ... GF Area subgrid(Ocean) Optimize via Algebraic Manipulations Different sequences of operators can give equivalent results These are equivalent, but the second avoids computing volume over the entire grid Modeling Data Product Generation
F GF Bool T T T - GF (Maybe Salt) 23 22 24 {KCell} {c1, c2, c3} Optimize via Choice of Implementation GF Salt select(s < 30) ? Modeling Data Product Generation
A Node’s neighbors don’t often change, so we can avoid re-computing this result Optimize via Shared Intermediate Results GF Velocity GF Velo N’hood GF Vorticity GF Neighbors GF Salt N’hood GF Salinity GF Salt Gradient Modeling Data Product Generation
Other niceties... • We don’t have to re-implement everything to realize benefits • But eventually we’ll want to wag the dog! • A collection of recipes can help... • communicate the product catalog • provide provenance • Derive new recipes from parts of old ones • support for product lines Modeling Data Product Generation
Summary • Modeling the current CORIE • Graphical System Description • pmon • Modeling the future CORIE • Grid Functions • Recipes • Reasoning • Optimization Modeling Data Product Generation
Milestones • RPE this spring • Specify existing data products using the model • Perform checks on existing production plans • Type • Schema / Resources • Quality Modeling Data Product Generation
A Thorough Experiment Management Schema Modeling Data Product Generation
A Good Start... task definition task instance (with parameters) task execution Modeling Data Product Generation
pmon Architecture pmon (Process Monitor) Database Web Server fam (File Alteration Monitor) imon, dnotify, or polling, depending on kernel patch Filesystem pacct (stopped process stats) /proc (running process info) acct (process accounting) Process to Monitor Linux Kernel Modeling Data Product Generation