170 likes | 310 Views
Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March 26-27, 2002 http://sdm.lbl.gov/sdmcenter (http://sdmcenter.lbl.gov). Original Goals and Framework. coordinated framework for the unification, development,
E N D
Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March 26-27, 2002 http://sdm.lbl.gov/sdmcenter (http://sdmcenter.lbl.gov)
Original Goals and Framework • coordinated framework for the • unification, • development, • deployment, and • reuse of scientific data management software • Framework • 4 areas (+ “glue”) • Very large, distributed, heterogeneous, data mining (+ agent technology) • 4 tier levels • Storage, file, dataset, federated data
Task Diagram 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level • Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)
Tapes Disks Tapes Disks Scientific Data Management ISIC Petabytes Petabytes Scientific Simulations & experiments • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: GTech, NCSU, NWU, SDSC Terabytes Terabytes • Climate Modeling • Astrophysics • Genomics and Proteomics • High Energy Physics SDM-ISIC Technology • Optimizing shared access from mass storage systems • Metadata and knowledge- based federations • API for Grid I/O • High-dimensional cluster analysis • High-dimensional indexing • Adaptive file caching • Agents … Data Manipulation: Data Manipulation: ~20% time • Using SDM-ISIC technology • Getting files from Tape archive • Extracting subset of data from files • Reformatting data • Getting data from heterogeneous, distributed systems • moving data over the network ~80% time Scientific Analysis & Discovery ~80% time Goals • Optimize and simplify: • access to very large datasets • access to distributed data • access of heterogeneous data • data mining of very large datasets Scientific Analysis & Discovery ~20% time Current Goal
Benefits to Applications • Efficiency • Example: by removing I/O bottlenecks – matching storage structures to the application • Effectiveness • Example: by making access to data from tertiary storage or various sites on the data grid “transparent”, more effective data exploration is possible • New algorithms • Example: by developing a more effective high-dimensional clustering technique for large datasets, discovery of new correlations are possible • Enabling ad-hoc exploration of data • Example: by enabling a “run and render” capability to visualize simulation output while the code is running, it is possible to monitor and steer a long-running simulation
How to execute plan? • Executive Committee • Made of area leaders • Organize into projects • Led by area leaders • Common theme • Multiple tasks combine into common goal • All tasks covered (some in more than one project) • Initially focus on one primary application area (more better) • Focus on one (or more) application scientists contacts • Focus on specific scenarios that represent real needs • Conference calls • Every Monday • Cycle on Project P1-P4 • Open to all • (Arie & Ekow attend all) • Quarterly reports • Half yearly all-hands
Organization of Projects: P1,P2, P3, P4 4) Distributed, heterogeneous data access d) Dataset Federation Level • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) • Knowledge-based federation of heterogeneous databases (SDSC) • 2) Access optimization • of distributed data 1) Storage and retrieval of Very large datasets 3) Data mining and discovery of access patterns • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access to tertiary storage (LBNL, ORNL) • High-dimensional indexing techniques (LBNL) c) Dataset Level • Multi-agent high-dimensional cluster analysis (ORNL) • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File Level • Low level API for grid I/O (ANL) • Dimension reduction and sampling (LLNL, LBNL) • Parallel I/O: improving parallel access from clusters (ANL, NWU) a) Storage Level • Adaptive file caching in a distributed system (LBNL) • [Grid Enabling Technology] • Optimization of low-level data storage, retrieval and transport (ORNL) 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU)
SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P1) Heterogeneous Data Integration (biology) • LLNL, SDSC, GATECH, NCSU, ORNL • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • LLNL, ORNL, LBNL • (P3) Efficient Access from Large Datasets (HENP, Combustion) • LBNL, ORNL • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • ANL, NWU, LLNL
SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P1) Heterogeneous Data Integration (biology) • LLNL - Terence • SDSC – Amarnath, Bertram, Ilkay • GATECH – Ling, Calton + students • NCSU – Mladen + Students • ORNL – Tom • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • LLNL – Chandrika, Ghaleb, Imola • ORNL – Nagiza, George, Tom • LBNL – Ekow
SDM center Projects and Primary Application Areas • Organized ourselves into 4 projects • (P3) Efficient Access from Large Datasets (HENP, Combustion) • LBNL – John, Ekow, Arie + postdoc • ORNL – Randy, Dan • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • ANL – Bill, Rob, Rajiv • NWU – Alok, Wei-Kang + students • LLNL – Ghaleb • Area leader at Large • Tom
SDM center Focus on real needs • Selected specific short term goals & scenarios • (P1) Heterogeneous Data Integration (biology) • Microarray analysis workflow scenario • (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics) • “Run and Render” scenario for Astrophysics • Dimensionality reduction for Climate model • (P3) Efficient Access from Large Datasets (HENP) • STAR analysis framework • (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate) • FLASH codes for Astrophysics • NetCDF using MPI-IO for Climate Modeling & Fusion
SDM center Application Scientists Contacts • Close collaboration with individuals • Matt Coleman - LLNL (Biology) • Tony Mezzacappa – ORNL (Astrophysics) • Ben Santer - LLNL, John Drake - ORNL (Climate) • Doug Olson - LBNL, Wei-Ming Zhang – Kent (HENP) • Wendy Koegler – Sandia L. (Combustion) • Mike Papka - ANL (Astrophysics Vis) • Mike Zingale – U of Chicago (Astrophysics) • John Michalakes – NCAR (Climate)
Organization of Meeting • First day • Applications perspective on data management needs • Explain why the need • Say what hurts the most • Technical details of current work and existing software • By project • Talks led by Area Leaders • Second day • Discuss and develop plans – 4 breakout sessions • Specific technical goals in next half year • SDM-ISIC people involved • Application people involved • Estimated schedule • Longer term projections (2-3 years) • Identify potential new applications – future focus • Planning • Conference calls – reporting • Intellectual property • CVS repositories • Future all-hands, September
Agenda - Morning Day 1, March 26 8:00 Introduction and opening remarks Arie Shoshani 8:15 Comments by DOE Program Manager John Van Rosendale 8:30 Astrophysics Perspective Tony Mezzacappa, ORNL 9:15 Climate Perspective John Drake, ORNL 10:00 –10:15 Break 10:15 HEP Perspective Doug Olson, LBNL 11:00 Biology Perspective Dave Nelson, LLNL 11:45 Putting software into production Randy Burris, ORNL 12:00 Lunch
Agenda – Afternoon • 1:00 PM • (P1) Heterorgeneous Data Access • Area Leader: Terence Critchlow • - Supporting Heterogeneous Data Access in Genomics • Presenter: Terence Critchlow • Context-sensitive Service Composition for Support of Scientific Workflows • Presenter: Mladen A. Vouk • - XWRAPComposer: A wrapper generation system for Integrating Bioinformatics Data Sources • Presenter: Ling Liu • - Constructing Workflows by Integrating Interactive Information Sources • Presenters: Amarnath Gupta & Ilkay Altintas • 2:00 PM • P2) Data Mining and Access Pattern Discovery • Area Leader: Nagiza Samatova • - ASPECT: Adaptable Simulation Product Exploration and Control Toolkit • presenter: Nagiza Samatova • - Dimension Reduction and Sampling • presenter: Imola Fodor • - Discovery of Access Patterns to Scientific Simulation Data • presenter: Ghaleb Abdulla 3:30 PM (P3) Efficient Access from Large Datasets area Leader: Arie ShoshanI - Supporting Ad-hoc Data Exploration for Large Scientific Databases presenter: Arie Shoshani - Efficient Bitmap Indexing Techniques for Very Large Datasets presenter: John Wu - Shared Disk File Caching Taking into Account Delays in Space Reservations, Transfer, and processing presenter: Ekow Otoo - Optimizing Shared Access to Tertiary Storage presenter: Randy Burris 4:30 PM (P4) Parallel Disk Access & Grid-IO Area Leaders: Bill Gropp and Alok Choudhary - Parallel and Grid I/O Infrastructure presenter: Rob Ross - Enabling High Performance Application I/O presenter: Wei-keng Liao 5:30 Comments from application people (1 hour) (free form discussion)
Agenda – Day 2 • 8:00 Welcome and logistics • 8:30 Recap and planning • 9:30 Project Breakout meetings (2 Hours) • Specific technical goals in next half year • SDM-ISIC people involved • Application people involved • Estimated schedule • Longer term projections (2-3 years) • Identify potential new applications – future focus • Lunch • 1:00 Project breakout meetings (2 Hours) • 3:00 Summary of meetings (2 Hour) • (30 min per project) • 5:00 Conclusion and planning