140 likes | 296 Views
Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory http://sdm.lbl.gov/sdmcenter. Participants. Center Director: Arie Shoshani DOE Laboratories: ANL: Bill Gropp <gropp@mcs.anl.gov> (coordinating PI)
E N D
Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory http://sdm.lbl.gov/sdmcenter
Participants Center Director: Arie Shoshani DOE Laboratories: ANL: Bill Gropp <gropp@mcs.anl.gov> (coordinating PI) Rob Ross <rross@mcs.anl.gov> LBNL: Ekow Otoo <ejotoo@lbl.gov> Arie Shoshani <shoshani@lbl.gov> (coordinating PI) LLNL: Terence Critchlow <critchlow@llnl.gov> (coordinating PI) ORNL: Randy Burris <burrisrd@ornl.gov> Thomas Potok <potokte@ornl.gov> (coordinating PI) Universities: Georgia Institute of Technology Ling Liu <lingliu@cc.gatech.edu> Calton Pu <calton.pu@cc.gatech.edu> (coordinating PI) North Carolina State University Mladen Vouk <vouk@csc.ncsu.edu> (coordinating PI) Northwestern University Alok Choudhary <choudhar@ece.nwu.edu> (coordinating PI) Wei-Keng Liao <wkliao@ece.nwu.edu> UC San Diego (Supercomputer Center): Amarnath Gupta <gupta@sdsc.edu> Reagan Moore <moore@sdsc.edu> (coordinating PI)
Original Goals and Framework • Coordinated framework for the • unification, • development, • deployment, and • reuse of scientific data management software • Framework • 4 areas • Very large databases • distributed databases • heterogeneous databases • data mining • (+ agent technology) • 4 tier levels • Storage level • File level • Dataset level • federated data level
Master Diagram 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level •Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)
Tapes Disks Tapes Disks Scientific Data Management ISIC Petabytes Petabytes Scientific Simulations & experiments • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: GTech, NCSU, NWU, SDSC Terabytes Terabytes • Climate Modeling • Astrophysics • Genomics and Proteomics • High Energy Physics SDM-ISIC Technology • Optimizing shared access from mass storage systems • Metadata and knowledge- based federations • API for Grid I/O • High-dimensional cluster analysis • High-dimensional indexing • Adaptive file caching • Agents … Data Manipulation: Data Manipulation: ~20% time • Using SDM-ISIC technology • Getting files from Tape archive • Extracting subset of data from files • Reformatting data • Getting data from heterogeneous, distributed systems • moving data over the network ~80% time Scientific Analysis & Discovery ~80% time Goals • Optimize and simplify: • access to very large datasets • access to distributed data • access of heterogeneous data • data mining of very large datasets Scientific Analysis & Discovery ~20% time Current Goal
Benefits to Applications • Efficiency • Example: by removing I/O bottlenecks – matching storage structures to the application • Effectiveness • Example: by making access to data from tertiary storage or various sites on the data grid “transparent”, more effective data exploration is possible • New algorithms • Example: by developing a more effective high-dimensional clustering technique for large datasets, discovery of new correlations are possible • Enabling ad-hoc exploration of data • Example: by enabling a “run and render” capability to visualize simulation output while the code is running, it is possible to monitor and steer a long-running simulation
Current Projects • High-Dimensional Clustering • Target applications: Astrophysics, Climate Modeling • LLNL, ORNL • Scientific problem targeted: To understand the mechanism(s) behind core-collapse supernovae it is crucial to explore and quantify: • The correlations between the neutrino flux and stellar core convection • The correlations between convection and spatial dimensionality • The correlations between convection and rotation • Contact: Anthony Mezzacappa, ORNL • Scientific problem targeted: Separating volcano and ENSO (El Nino Southern oscillation) signals from the rest of the climate data to study variability in temperature • Contact: Ben Santer, PCMDI, LLNL
Current Projects 2) Efficient Parallel I/O to Disk Storage • Target application: Astrophysics • ANL, NWU, LLNL • Scientific problem targeted: Astrophysics simulation code (FLASH): Early production runs spent as much as half of the time writing checkpoint and vizualization data • Contact: Mike Zingale, U of Chicago • Scientific problem targeted: improving parallel I/O efficiency for tiled displays - a popular medium for collaborative viewing of high-resolution visualization Astrophysics data • Contact: Mike Papka, ANL • Scientific problem targeted: Query pattern analysis for astrophysics star data devising disk layout for the data such that overall data access time across multiple applications and users is reduced • Contact: LLNL
Current Projects 3) Providing transparent access to grid data • Target application: High Energy Physics • LBNL, ORNL • Scientific problem targeted: given a logical request (expressed on event attributes), get relevant data from grid sites and tertiary storage to application code without human intervention • Contact: Doug Olson, LBNL • Contact: Stephen Gowdy, SLAC • Contact: Jackie Chan, Sandia Livermore (combustion)
Current Projects 4) Heterogeneous Data Federation • Target application: Biology • LLNL, SDSC, GTU, NCSU, ORNL • Scientific problem targeted: to developing our infrastructure in support of cancer researchers at LLNL, who expect to use it to help identify genes which respond to low-doses of radiation. This problem is difficult because the information required by the scientists is spread across many, independent, web-based data sources - each using their own interfaces and data formats • Contact: Matt Coleman, LLNL