1 / 13

Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate

Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory http://sdm.lbl.gov/sdmcenter. Participants. Center Director: Arie Shoshani DOE Laboratories: ANL: Bill Gropp <gropp@mcs.anl.gov> (coordinating PI)

conroy
Download Presentation

Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory http://sdm.lbl.gov/sdmcenter

  2. Participants Center Director: Arie Shoshani DOE Laboratories: ANL: Bill Gropp <gropp@mcs.anl.gov> (coordinating PI) Rob Ross <rross@mcs.anl.gov> LBNL: Ekow Otoo <ejotoo@lbl.gov> Arie Shoshani <shoshani@lbl.gov> (coordinating PI) LLNL: Terence Critchlow <critchlow@llnl.gov> (coordinating PI) ORNL: Randy Burris <burrisrd@ornl.gov> Thomas Potok <potokte@ornl.gov> (coordinating PI)  Universities: Georgia Institute of Technology Ling Liu <lingliu@cc.gatech.edu> Calton Pu <calton.pu@cc.gatech.edu> (coordinating PI) North Carolina State University Mladen Vouk <vouk@csc.ncsu.edu> (coordinating PI) Northwestern University Alok Choudhary <choudhar@ece.nwu.edu> (coordinating PI) Wei-Keng Liao <wkliao@ece.nwu.edu> UC San Diego (Supercomputer Center): Amarnath Gupta <gupta@sdsc.edu> Reagan Moore <moore@sdsc.edu> (coordinating PI)

  3. Original Goals and Framework • Coordinated framework for the • unification, • development, • deployment, and • reuse of scientific data management software • Framework • 4 areas • Very large databases • distributed databases • heterogeneous databases • data mining • (+ agent technology) • 4 tier levels • Storage level • File level • Dataset level • federated data level

  4. Master Diagram 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level •Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)

  5. Tapes Disks Tapes Disks Scientific Data Management ISIC Petabytes Petabytes Scientific Simulations & experiments • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: GTech, NCSU, NWU, SDSC Terabytes Terabytes • Climate Modeling • Astrophysics • Genomics and Proteomics • High Energy Physics SDM-ISIC Technology • Optimizing shared access from mass storage systems • Metadata and knowledge- based federations • API for Grid I/O • High-dimensional cluster analysis • High-dimensional indexing • Adaptive file caching • Agents … Data Manipulation: Data Manipulation: ~20% time • Using SDM-ISIC technology • Getting files from Tape archive • Extracting subset of data from files • Reformatting data • Getting data from heterogeneous, distributed systems • moving data over the network ~80% time Scientific Analysis & Discovery ~80% time Goals • Optimize and simplify: • access to very large datasets • access to distributed data • access of heterogeneous data • data mining of very large datasets Scientific Analysis & Discovery ~20% time Current Goal

  6. Benefits to Applications • Efficiency • Example: by removing I/O bottlenecks – matching storage structures to the application • Effectiveness • Example: by making access to data from tertiary storage or various sites on the data grid “transparent”, more effective data exploration is possible • New algorithms • Example: by developing a more effective high-dimensional clustering technique for large datasets, discovery of new correlations are possible • Enabling ad-hoc exploration of data • Example: by enabling a “run and render” capability to visualize simulation output while the code is running, it is possible to monitor and steer a long-running simulation

  7. Current Projects • High-Dimensional Clustering • Target applications: Astrophysics, Climate Modeling • LLNL, ORNL • Scientific problem targeted: To understand the mechanism(s) behind core-collapse supernovae it is crucial to explore and quantify: • The correlations between the neutrino flux and stellar core convection • The correlations between convection and spatial dimensionality • The correlations between convection and rotation • Contact: Anthony Mezzacappa, ORNL • Scientific problem targeted: Separating volcano and ENSO (El Nino Southern oscillation) signals from the rest of the climate data to study variability in temperature • Contact: Ben Santer, PCMDI, LLNL

  8. Current Projects 2) Efficient Parallel I/O to Disk Storage • Target application: Astrophysics • ANL, NWU, LLNL • Scientific problem targeted: Astrophysics simulation code (FLASH): Early production runs spent as much as half of the time writing checkpoint and vizualization data • Contact: Mike Zingale, U of Chicago • Scientific problem targeted: improving parallel I/O efficiency for tiled displays - a popular medium for collaborative viewing of high-resolution visualization Astrophysics data • Contact: Mike Papka, ANL • Scientific problem targeted: Query pattern analysis for astrophysics star data devising disk layout for the data such that overall data access time across multiple applications and users is reduced • Contact: LLNL

  9. Current Projects 3) Providing transparent access to grid data • Target application: High Energy Physics • LBNL, ORNL • Scientific problem targeted: given a logical request (expressed on event attributes), get relevant data from grid sites and tertiary storage to application code without human intervention • Contact: Doug Olson, LBNL • Contact: Stephen Gowdy, SLAC • Contact: Jackie Chan, Sandia Livermore (combustion)

  10. Current Projects 4) Heterogeneous Data Federation • Target application: Biology • LLNL, SDSC, GTU, NCSU, ORNL • Scientific problem targeted: to developing our infrastructure in support of cancer researchers at LLNL, who expect to use it to help identify genes which respond to low-doses of radiation. This problem is difficult because the information required by the scientists is spread across many, independent, web-based data sources - each using their own interfaces and data formats • Contact: Matt Coleman, LLNL

More Related