Automatic Data Selection for In Situ Analysis

Automatic Data Selection for In Situ Analysis Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based simulation data. Data must be reduced, but which data do we keep or look at? Technical Solution: Apply automatic data selection operations using various metrics and criteria to emulate the exploration during physics simulations (in situ) and make the data bandwidth use more “information dense.” Physics simulations are reduced to a representative set based on the selected metric that is used. Science Impact: Scientist and analyst time is more efficient and bandwidth (cognitive, storage and I/O) is “richer,” as more scientifically-relevant simulation information is packed into it. Woodring, Myers, Nouanesengsy, Wendelberger, Patchett, Fasel, Ahrens dynamic computational detectors raw data events flow control (lightweight) triggers data products (heavyweight) discovery

Combining Statistics, Sampling, and Indexing to Quantitatively Scale Massive Scientific Data Science Problem: Climate and cosmological analysis and exploration via queries on their massive data sets may still result in a massive amount of data for the scientist to comb through and/or transfer. Technical Solution: Combine bitmap indexing with stratified random sampling (stratified by bins, random within bins) with precalculated errors to quickly and quantitatively scale data. Science Impact: Climate and cosmological scientists are able to interactively and automatically scale data queries down to smaller samples with automatic error estimates, accelerating their scientific analysis workflow. Su, Agrawal, Woodring, Myers, Wendelberger, and Ahrens. “Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices.” To appear in HPDC 2013.

Metrics and Workflow for Quantifying the Quality of ReductionTransformations on Large-Scale, Scientific Scalar Data Science Problem: Storage and I/O have not kept up with climate and cosmological simulation capacity to generate data. Data must be reduced, but what has been lost in the process of reduction? Technical Solution: After every reducing transformation (T), assess and record the residuals/errors via compact measures (M) comparing reduced (a) to “original” data (A) by reconstructed data (B). Science Impact: Measuring the data uncertainty via a quantitative quality provenance shows that climate and cosmological scientists are able to use reduced data for scientific analysis. Woodring, Shafii, Biswas, Myers, Wendelberger, Hamann, and Shen. “Metrics and Workflow for Quantifying the Quality of Reduction Transformations on Large-Scale, Scientific Scalar Data.” Submitted to LDAV 2013.

Exploration of Exascale In Situ Visualization and Analysis Approaches IMD Exascale and “Big Data” running simulations, running experiments, static repositories, etc. Novel Ideas Perception, cognitive capacity, and computer bandwidth are limited, but the scale of data continues to increase. In order to save both compute cycles and analyst time, we explicitly select and present data to the scientist: • Time, space, and variable selection algorithms • Store and index selected data products • Interactive presentation and query methods • Illustratively and artistically highlight to perceptually drive scientists to key data Data Selection Algorithms (time, space, variable, product, etc.) raw image geo-metry … Analysis with Perceptually-Driven Highlighting Expected Impact Milestones and Status • In our first year, we have developed several data selection algorithms, designed a prototype visualization and analysis system that utilizes selected data, and quantified the effects on selecting scientific data. • Milestone Expected Actual • Data Selection Algorithms 03/13 03/13 • Data Product Explorer 09/13 09/13 • Selection Quality Measurements 09/13 09/13 • Advanced Selection Algorithms 03/14 • Perceptually-Driven Presentation 03/14 • Selection in Data Product Explorer 09/14 • Domain-Driven Selection 03/15 • Bandwidth Utilization Adjustment 09/15 • Advanced Presentation Methods 09/15 • Save compute time as only selected data are stored with limited I/O bandwidth and capacity • Not all data can be saved from an exascale simulation, therefore we must be prescriptive on what data are saved • We will provide data selection algorithms • Save analysis time as selected data are presented to the scientist • The resulting data will still be massive, more than any one scientist can look at • Interaction, query, and highlighting methods drive scientists to view key data Principal Investigator: James Ahrens et al., LANL Sept. 25, 2013

Automatic Data Selection for In Situ Analysis

Automatic Data Selection for In Situ Analysis

Presentation Transcript

Chameleon Automatic Selection of Collections

Attacking Automatic Wireless Network Selection

Feature Selection for Automatic Taxonomy Induction

Automatic Data Selection for In Situ Analysis

DAPPER: An OPENDAP Server for In-Situ Data

In situ MET performance in early data

In situ

Automatic Index Selection for Shifting Workloads

In Situ Data Collection Currents

Chameleon Automatic Selection of Collections

MEsoSCale dynamical Analysis through combined model, satellite and in situ data

Tools for accessing distributed in-situ data collections

EPIC Tools for in-situ data collections

Automatic Data

Plant Genetic Resource Gap Analysis: targeting CWR for in situ and ex situ conservation

Feature Selection Stability Analysis for Classification Using Microarray Data

Image Analysis for Automatic Phenotyping

Automatic In Situ Identification of Plankton

Automatic Indexing (Term Selection)

EPIC Tools for in-situ data collections

Structural Proteomics Automatic Target Selection

In Situ Data, ncBrowse and OPeNDAP