100 likes | 196 Views
Extreme Scale Analytics on Spatio -Temporal Datasets. Joel Saltz Center for Comprehensive Informatics & Biomedical Informatics Department Emory University. Morphometric Image Analysis Pipeline. Preprocessing: normalization, tiling, etc. Segmentation: identify nuclei as objects
E N D
Extreme Scale Analytics on Spatio-Temporal Datasets Joel Saltz Center for Comprehensive Informatics & Biomedical Informatics Department Emory University
Morphometric Image Analysis Pipeline • Preprocessing: normalization, tiling, etc. • Segmentation: identify nuclei as objects • Feature Extraction: compute morphometric features • Classification: unsupervised learning (k-means) after patient-level aggregation and analysis
Subsurface Reservoir Management • Numerical models of porous media • Fluids flow from one region of reservoir to another region • Rock and sediment properties change over time • Simulate multiple realizations of multiple models and management strategies • Evaluate geologic uncertainty and management strategies simultaneously • Enable on-demand exploration and comparison of multiple scenarios
Challenges • Spatial-temporal disk-resident, on-the-fly, dynamically updated datasets • Access and manipulate multiple datasets generated and stored on multiple, distributed systems • Analysis of raw data can generate millions to trillions of features (e.g., millions of cells and nuclei in high resolution tissue images) to be mined and compared • Take advantage of hardware platforms for analysis • Clusters containing hybrid CPU-GPU nodes • Extreme scale machines consisting of hundreds of thousands of CPU cores • Systems with deep memory and storage hierarchies • Cloud computing platforms
Data Structures: Region Templates • Describe 2D/3D static and temporal regions. • Provides a container for points, arrays, regions, and object sets within a spatial and temporal bounding box. • A region template can represent collections of spatial areas and objects where these entities vary from one another in size and shape; e.g. regions generated by segmenting cells in microscopy images, man-made structures or hurricanes in satellite imagery. • Primary datasets are defined as point data elements and arrays, and derived datasets as sets of regions and objects. • Region templates may be related to one another in a defined manner.
Programming Abstractions and Runtime Middleware Services • Programming abstractions • Multi-level dataflow pipelines • MapReduce style programs • Spatial query capabilities • I/O and Storage Services • Indexing and metadata management for ensembles of datasets • I/O support for retrieving data from multiple storage systems and for streaming data • Query capabilities • Memory Management • Careful management and staging of large data structures across memory hierarchies. Masking data movement costs with computation. • Execution Services • Distributing and rearranging computations and data to minimize data movement • Coordinated scheduling and mapping of analysis operations to heterogeneous and hybrid (CPU cores and GPUs) systems to increase overall application throughput • Quality of service/data requirements • Function variants • Provenance Tracking, Fault-detection and tolerance