200 likes | 308 Views
Scientific Data Management Center (ISIC). http://sdmcenter.lbl.gov contains extensive publication list. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross ANL Arie Shoshani, Doron Rotem LBNL
E N D
Scientific Data Management Center (ISIC) http://sdmcenter.lbl.gov contains extensive publication list
Scientific Data Management Center Participating Institutions • Center PI: • Arie Shoshani LBNL • DOE Laboratories co-PIs: • Bill Gropp, Rob Ross ANL • Arie Shoshani, Doron Rotem LBNL • Terence Critchlow, Chandrika Kamath LLNL • Nagiza Samatova, Andy White ORNL • Universities co-PIs : • Mladen Vouk North Carolina State • Alok Choudhary Northwestern • Reagan Moore, Bertram Ludaescher UC San Diego (SDSC) • Calton Pu Georgia Tech • Steve Parker U of Utah (future)
Phases of Scientific Exploration • Data Generation • From large scale simulations or experiments • Fast data growth with computational power • examples • HENP: 100 Teraops and 10 Petabytes by 2006 • Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20 • Problems • Can’t dump the data to storage fast enough – waste of compute resources • Can’t move terabytes of data over WAN robustly – waste of scientist’s time • Can’t steer the simulation – waste of time and resource • Need to reorganize and transform data – large data intensive tasks slowingprogress
Phases of Scientific Exploration • Data Analysis • Analysis of large data volume • Can’t fit all data in memory • Problems • Find the relevant data – need efficient indexing • Cluster analysis – need linear scaling • Feature selection – efficient high-dimensional analysis • Data heterogeneity – combine data from diverse sources • Streamline analysis steps – output of one step needs to match input of next
Example Data Flow in TSI Logistical Network Courtesy: John Blondin
Goal: Reduce the Data Management Overhead • Efficiency • Example: parallel I/O, indexing, matching storage structures to the application • Effectiveness • Example: Access data by attributes-not files, facilitate massive data movement • New algorithms • Example: Specialized PCA techniques to separate signals or to achieve better spatial data compression • Enabling ad-hoc exploration of data • Example: by enabling exploratory “run and render” capability to analyze and visualize simulation output while the code is running
Approach SDM Framework • Use an integrated framework that: • Provides a scientific workflow capability • Supports data mining and analysis tools • Accelerates storage and access to data • Simplify data management tasks for the scientist • Hide details of underlying parallel and indexingtechnology • Permit assembly of modules using a simple graphical workflow description tool Scientific Process Automation Layer Data Mining & Analysis Layer Scientific Application Scientific Understanding Storage Efficient Access Layer
P0 P1 P2 P3 netCDF Parallel File System P0 P1 P2 P3 Parallel netCDF Parallel File System Accomplishments:Storage Efficient Access (SEA) Shared memory communication Parallel Virtual File System: Enhancements and deployment • Developed Parallel netCDF • Enables high performance parallel I/O to netCDF datasets • Achieves up to 10 fold performance improvement over HDF5 • Enhanced ROMIO: • Provides MPI access to PVFS • Advanced parallel file system interfaces for more efficient access • Developed PVFS2 • Adds Myrinet GM and InfiniBand support • improved fault tolerance • asynchronous I/O • offered by Dell and HP for Clusters • Deployed an HPSS Storage Resource Manager (SRM) with PVFS • Automatic access of HPSS files to PVFS through MPI-IO library • SRM is a middleware component After Before FLASH I/O Benchmark Performance (8x8x8 block sizes)
Anywhere DataMover Get list of files SRM-COPY (thousands of files) NCAR LBNL SRM-GET (one file at a time) SRM (performs writes) SRM (performs reads) GridFTP GET (pull mode) MSS Network transfer archive files stage files Disk Cache Disk Cache Robust Multi-file Replication • Problem: move thousands of files robustly • Takes many hours • Need error recovery • Mass storage systems failures • Network failures • Use Storage Resource Managers (SRMs) • Problem: too slow • Use parallel streams • Use concurrent transfers • Use large FTP windows • Pre-stage files from MSS
File tracking helps to identify bottlenecks Shows that archiving is the bottleneck
File tracking shows recovery from transient failures Total: 45 GBs
Accomplishments:Data Mining and Analysis (DMA) • Developed Parallel-VTK • Efficient 2D/3D Parallel Scientific Visualization for NetCDF and HDF files • Built on top of PnetCDF • Developed “region tracking” tool • For exploring 2D/3D scientific databases • Using bitmap technology to identify regions based on multi-attribute conditions • Implemented Independent Component Analysis (ICA) module • Used for accurate for signal separation • Used for discovering key parameters that correlate with observed data • Developed highly effective data reduction • Achieves 15 fold reduction with high level of accuracy • Using parallel Principle Component Analysis(PCA) technology • Developed ASPECT • A framework that supports a rich set ofpluggable data analysis tools • Including all the tools above • A rich suite of statistical tools based on R package Combustion region tracking El Nino signal (red) and estimation (blue) closely match
Data Select Data Access Correlate Render Display (temp, pressure)From astro-data Where (step=101)(entropy>1000); Sample (temp, pressure) Run R analysis Run pVTK filter Visualize scatter plot in QT ASPECT Analysis Environment pVTK Tool R Analysis Tool Select Data Take Sample Data Mining & Analysis Layer Read Data (buffer-name) Write Data Read Data (buffer-name) Write Data Read Data (buffer-name) Get variables (var-names, ranges) Use Bitmap (condition) Bitmap Index Selection Storage Efficient Access Layer PVFS Parallel NetCDF Hardware, OS, and MSS (HPSS)
Accomplishments:Scientific Process Automation (SPA) Unique requirements of scientific WFs • Moving large volumes between modules • Tightlly-coupled efficient data movement • Specification of granularity-based iteration • e.g. In spatio-temporal simulations – a time step is a “granule” • Support for data transformation • complex data types (including file formats, e.g. netCDF, HDF) • Dynamic steering of workflow by user • Dynamic user examination of results Developed a working scientific work flow system • Automatic microarray analysis • Using web-wrapping tools developed by the center • Using Kepler WF engine • Kepler is an adaptation of the UC Berkeley tool, Ptolemy workflow steps defined graphically workflow results presented to user
Re-applying Technology Technology Parallel NetCDF Parallel VTK Compressed bitmaps Storage Resource Managers Feature Selection Scientific Workflow SDM technology, developed for one application, can be effectively targeted at many other applications … Initial Application Astrophysics Astrophysics HENP HENP Climate Biology New Applications Climate Climate Combustion, Astrophysics Astrophysics Fusion Astrophysics (planned)
Broad Impact of the SDM Center… Astrophysics: High speed storage technology, parallel NetCDF, parallel VTK, and ASPECT integration software used for Terascale Supernova Initiative (TSI) and FLASH simulations Tony Mezzacappa – ORNL, John Blondin –NCSU, Mike Zingale – U of Chicago, Mike Papka – ANL Climate: High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR Combustion: Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over time Wendy Koegler, Jacqueline Chen – Sandia Lab ASCI FLASH – parallel NetCDF Dimensionality reduction Region growing
Broad Impact (cont.) Biology: Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray data Matt Coleman - LLNL High Energy Physics: Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSS Doug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL Fusion: A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak Keith Burrell - General Atomics Building a scientific workflow Dynamic monitoring of HPSS file transfers Identifying key parametersfor the DIII-D Tokamak
Goals for Years 4-5 • Fully develop the integrated SDM framework • Implement the 3 layer framework on SDM center facility • Provide a way to select only components needed • Develop self-guiding web pages on the use of SDM components • Use existing successful examples as guides • Generalize components for reuse • Develop general interfaces between components in the layers • support loosely-coupled WSDL interfaces • Support tightly-coupled components for efficient dataflow • Integrate operation of components in the framework • Hide details form user – automate parallel access and indexing • Develop a reusable library of components that can be selected for use in the workflow system