1 / 17

Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani

Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March 26-27, 2002 http://sdm.lbl.gov/sdmcenter (http://sdmcenter.lbl.gov). Original Goals and Framework. coordinated framework for the unification, development,

tania
Download Presentation

Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March 26-27, 2002 http://sdm.lbl.gov/sdmcenter (http://sdmcenter.lbl.gov)

  2. Original Goals and Framework • coordinated framework for the • unification, • development, • deployment, and • reuse of scientific data management software • Framework • 4 areas (+ “glue”) • Very large, distributed, heterogeneous, data mining (+ agent technology) • 4 tier levels • Storage, file, dataset, federated data

  3. Task Diagram 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level • Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)

  4. Tapes Disks Tapes Disks Scientific Data Management ISIC Petabytes Petabytes Scientific Simulations & experiments • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: GTech, NCSU, NWU, SDSC Terabytes Terabytes • Climate Modeling • Astrophysics • Genomics and Proteomics • High Energy Physics SDM-ISIC Technology • Optimizing shared access from mass storage systems • Metadata and knowledge- based federations • API for Grid I/O • High-dimensional cluster analysis • High-dimensional indexing • Adaptive file caching • Agents … Data Manipulation: Data Manipulation: ~20% time • Using SDM-ISIC technology • Getting files from Tape archive • Extracting subset of data from files • Reformatting data • Getting data from heterogeneous, distributed systems • moving data over the network ~80% time Scientific Analysis & Discovery ~80% time Goals • Optimize and simplify: • access to very large datasets • access to distributed data • access of heterogeneous data • data mining of very large datasets Scientific Analysis & Discovery ~20% time Current Goal

  5. Benefits to Applications • Efficiency • Example: by removing I/O bottlenecks – matching storage structures to the application • Effectiveness • Example: by making access to data from tertiary storage or various sites on the data grid “transparent”, more effective data exploration is possible • New algorithms • Example: by developing a more effective high-dimensional clustering technique for large datasets, discovery of new correlations are possible • Enabling ad-hoc exploration of data • Example: by enabling a “run and render” capability to visualize simulation output while the code is running, it is possible to monitor and steer a long-running simulation

  6. How to execute plan? • Executive Committee • Made of area leaders • Organize into projects • Led by area leaders • Common theme • Multiple tasks combine into common goal • All tasks covered (some in more than one project) • Initially focus on one primary application area (more better) • Focus on one (or more) application scientists contacts • Focus on specific scenarios that represent real needs • Conference calls • Every Monday • Cycle on Project P1-P4 • Open to all • (Arie & Ekow attend all) • Quarterly reports • Half yearly all-hands

  7. Organization of Projects: P1,P2, P3, P4 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level • Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)

  8. SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P1) Heterogeneous Data Integration (biology) • LLNL, SDSC, GATECH, NCSU, ORNL • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • LLNL, ORNL, LBNL • (P3) Efficient Access from Large Datasets (HENP, Combustion) • LBNL, ORNL • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • ANL, NWU, LLNL

  9. SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P1) Heterogeneous Data Integration (biology) • LLNL - Terence • SDSC – Amarnath, Bertram, Ilkay • GATECH – Ling, Calton + students • NCSU – Mladen + Students • ORNL – Tom • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • LLNL – Chandrika, Ghaleb, Imola • ORNL – Nagiza, George, Tom • LBNL – Ekow

  10. SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P3) Efficient Access from Large Datasets (HENP, Combustion) • LBNL – John, Ekow, Arie + postdoc • ORNL – Randy, Dan • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • ANL – Bill, Rob, Rajiv • NWU – Alok, Wei-Kang + students • LLNL – Ghaleb • Area leader at Large • Tom

  11. SDM center Focus on real needs • Selected specific short term goals & scenarios • (P1) Heterogeneous Data Integration (biology) • Microarray analysis workflow scenario • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • “Run and Render” scenario for Astrophysics • Dimensionality reduction for Climate model • (P3) Efficient Access from Large Datasets (HENP) • STAR analysis framework • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • FLASH codes for Astrophysics • NetCDF using MPI-IO for Climate Modeling & Fusion

  12. SDM center Application Scientists Contacts • Close collaboration with individuals • Matt Coleman - LLNL (Biology) • Tony Mezzacappa – ORNL (Astrophysics) • Ben Santer - LLNL, John Drake - ORNL (Climate) • Doug Olson - LBNL, Wei-Ming Zhang – Kent (HENP) • Wendy Koegler – Sandia L. (Combustion) • Mike Papka - ANL (Astrophysics Vis) • Mike Zingale – U of Chicago (Astrophysics) • John Michalakes – NCAR (Climate)

  13. Organization of Meeting • First day • Applications perspective on data management needs • Explain why the need • Say what hurts the most • Technical details of current work and existing software • By project • Talks led by Area Leaders • Second day • Discuss and develop plans – 4 breakout sessions • Specific technical goals in next half year • SDM-ISIC people involved • Application people involved • Estimated schedule • Longer term projections (2-3 years) • Identify potential new applications – future focus • Planning • Conference calls – reporting • Intellectual property • CVS repositories • Future all-hands, September

  14. Agenda - Morning Day 1, March 26 8:00 Introduction and opening remarks Arie Shoshani 8:15 Comments by DOE Program Manager John Van Rosendale 8:30 Astrophysics Perspective Tony Mezzacappa, ORNL 9:15 Climate Perspective John Drake, ORNL 10:00 –10:15 Break 10:15 HEP Perspective Doug Olson, LBNL 11:00 Biology Perspective Dave Nelson, LLNL 11:45 Putting software into production Randy Burris, ORNL 12:00 Lunch

  15. Agenda – Afternoon • 1:00 PM • (P1) Heterorgeneous Data Access • Area Leader: Terence Critchlow • - Supporting Heterogeneous Data Access in Genomics • Presenter: Terence Critchlow • Context-sensitive Service Composition for Support of Scientific Workflows • Presenter: Mladen A. Vouk • - XWRAPComposer: A wrapper generation system for Integrating Bioinformatics Data Sources • Presenter: Ling Liu • - Constructing Workflows by Integrating Interactive Information Sources • Presenters: Amarnath Gupta & Ilkay Altintas • 2:00 PM • P2) Data Mining and Access Pattern Discovery • Area Leader: Nagiza Samatova • - ASPECT: Adaptable Simulation Product Exploration and Control Toolkit • presenter: Nagiza Samatova • - Dimension Reduction and Sampling • presenter: Imola Fodor • - Discovery of Access Patterns to Scientific Simulation Data • presenter: Ghaleb Abdulla 3:30 PM (P3) Efficient Access from Large Datasets area Leader: Arie ShoshanI - Supporting Ad-hoc Data Exploration for Large Scientific Databases presenter: Arie Shoshani - Efficient Bitmap Indexing Techniques for Very Large Datasets presenter: John Wu - Shared Disk File Caching Taking into Account Delays in Space Reservations, Transfer, and processing presenter: Ekow Otoo - Optimizing Shared Access to Tertiary Storage presenter: Randy Burris 4:30 PM (P4) Parallel Disk Access & Grid-IO Area Leaders: Bill Gropp and Alok Choudhary - Parallel and Grid I/O Infrastructure presenter: Rob Ross - Enabling High Performance Application I/O presenter: Wei-keng Liao 5:30 Comments from application people (1 hour) (free form discussion)

  16. Agenda – Day 2 • 8:00 Welcome and logistics • 8:30 Recap and planning • 9:30 Project Breakout meetings (2 Hours) • Specific technical goals in next half year • SDM-ISIC people involved • Application people involved • Estimated schedule • Longer term projections (2-3 years) • Identify potential new applications – future focus • Lunch • 1:00 Project breakout meetings (2 Hours) • 3:00 Summary of meetings (2 Hour) • (30 min per project) • 5:00 Conclusion and planning

More Related