320 likes | 456 Views
Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross* ANL Arie Shoshani, Doron Rotem LBNL
E N D
Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City
Scientific Data Management Center Participating Institutions • Center PI: • Arie Shoshani LBNL • DOE Laboratories co-PIs: • Bill Gropp, Rob Ross* ANL • Arie Shoshani, Doron Rotem LBNL • Terence Critchlow*, Chandrika Kamath LLNL • Nagiza Samatova* ORNL • Universities co-PIs : • Mladen Vouk North Carolina State • Alok Choudhary Northwestern • Reagan Moore, Bertram Ludaescher UC San Diego (SDSC) and UC Davis • Steve Parker U of Utah • * Area Leaders
Agenda Session 1 (morning of first day): status reports: SEA, DMA, SPA 8:15 – 10:00 10:00 Break 10:30 – 12:00 12:00 Lunch Session 2 (afternoon of first day): Application talks Topics: Talks by application people working with us, success stories, needs, bottlenecks, imagined new uses of SDM technologies 1:30 – 3:00 Eric Myra – Astrophysics Jackie Chan - Combustion Scott Klasky – Fusion 3:00 – 3:30 Break 3:30 – 5:00 Jerome Lauret – High Energy Physics Elliot Peele - Astrophysics Wes Bethel - Visualization Session 3 (morning of second day): Panel with Apps people Moderator: Doron Rotem 8:30 – 10:00 panel: part 1 End-to-end use cases vs. Technologies 10:00 – 10:30 break 10:30 – 12:00 panel: part 2 Engaging the sciences Session 4 (afternoon of second day) 1:30 – 4:00 future planning Topics: Discussion on the future plans including integration with other ISICs, considerations for new technology areas, universities role, planning for proposal (Official end of meeting) 4:00 – 4:30 break 4:30 – 6:00 informal meetings/discussions
A Typical SDM Scenario Task A: Generate Time-Steps Flow Tier Task B: Move TS Task D: Visualize TS Task C: Analyze TS Control Flow Layer + Applications & Software Tools Layer Data Mover Post Processing Parallel R Terascale Browser Simulation Program Work Tier I/O System Layer HDF5 Libraries Parallel NetCDF PVFS Sabul SRM Storage & Network Resouces Layer
Technology Details by Layer Scientific Scientific Web Web WorkFlow WorkFlow Process Process Wrapping Wrapping Management Management Automation Automation Tools Tools Tools Tools (SPA) (SPA) Layer Layer Data Data Efficient Efficient ASPECT: ASPECT: Data Data Efficient Efficient Parallel R Mining & Mining & indexing indexing integration integration Analysis Analysis Parallel Parallel Statistical Analysis Analysis (Bitmap (Bitmap Framework Framework tools tools Visualization Visualization Analysis Index) Index) (PCA, ICA) (PCA, ICA) ( ( pVTK pVTK ) ) (DMA) (DMA) Layer Layer Storage Storage Parallel Parallel Parallel Parallel ROMIO ROMIO Storage Storage Efficient Efficient NetCDF NetCDF Virtual Virtual MPI MPI - - IO IO Resource Resource Access Access Software Software File File System System Manager Manager Layer Layer System System (SEA) (SEA) (To HPSS) (To HPSS) Layer Layer Hardware, OS, and MSS (HPSS) Hardware, OS, and MSS (HPSS)
Applications Panel: part 1end-to-end vs. SDM technologies • Scientific Exploration Phases • Data Generation • Post-processing / summarization • Data Analysis • End-to-end use cases • For each phase • A combination of phases • What SDM technologies are needed/applicable • Workflow and dataflow • Efficient I/O from/to disk and tertiary storage • Searching and indexing • General analysis and visualization tools • Large-scale data movement • Metadata management • Missing topic?
Phases of Scientific Exploration • Data Generation • From large scale simulations or experiments • Fast data growth with computational power • examples • HENP: 100 Teraops and 10 Petabytes by 2006 • Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20 • Problems • Can’t dump the data to storage fast enough – waste of compute resources • Can’t move terabytes of data over WAN robustly – waste of scientist’s time • Can’t steer the simulation – waste of time and resource
Phases of Scientific Exploration • Post-processing/summarization • Process raw data from experiments/simulations • May generate as much data as original raw data • e.g. HENP: process detectors raw data to produce “tracks”, “vertices”, etc. • e.g. Climate: generate vertical organization of datafrom: time-step -> space points -> all variablesto: variable -> time point -> all spaceor: variable -> space points -> all times • Summarization • Produce high level summaries for preliminary analysis and/or efficient search • e.g. HENP: “total_energy”, “number_of_particles per “event” • Produce summarization of space/time for coarse analysis • e.g. Climate: generate “monthly-means” • Problems • Large volume “read” -> large volume “write” • Summarization – good metadata • Need to produce indexes to search over large data • Need to reorganize and transform data – large data intensive tasks
Phases of Scientific Exploration • Data Analysis • Analysis of large data volume • Can’t fit all data in memory • Problems • Find the relevant data – need efficient indexing • Cluster analysis – need linear scaling • Feature selection – efficient high-dimensional analysis • Data heterogeneity – combine data from diverse sources • Streamline analysis steps – output of one step needs to match input of next • Read data fast enough from disk storage • Pre-stage data from tertiary storage
Vision: facilitating end-to-end data management • Support entire scenarios • e.g. Data generation, post-processing, analysis • Be willing to apply any technology necessary • SDM center technology • Adapt technology as necessary • Package technology as components • Integration of technologies • Make SDM technology components callable from workflow • Facilitate the use of scientific workflow tools • Manage launching of tasks • Manage data movement • Permit dynamic interaction with workflow • Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures • Need to work closely with app scientists • Identify end-to-end use cases (scenarios) • App scientist should be funded, too • App scientists are the “messengers of good news”
Lessons learned – technology (1) • Scientific workflow is an important paradigm • Coordination of tasks AND Management of data flow • Managing repetitive steps • Tracking, estimation • Efficient I/O is often the bottleneck • Technology essential for efficient computation • Mass storage need to be seamlessly managed • Opportunities to interact with Math packages • Searching and indexing • Searching over billions of objects • Searching in space/time • Searching in multi-dimensional space
Lessons learned – technology (2) • General analysis tools are useful • Statistical analysis, cluster analysis • Feature selection and extraction • Parallelization is key to scaling • Visualization is an integral part of analysis • Data movement is complex • Network infrastructure is not enough – can be unreliable • Need robust software to manage failures • Need to manage space allocation • Managing format mismatch is part of data flow • Metadata emerging as an important need • Description of experiments/simulation • Provenance • Use of hints / access patterns
Fusion - Klasky Post-processing/ Summarization Data Analysis Data Generation
Combustion - Chen Post-processing/ Summarization Data Analysis Data Generation
HENP - Lauret Post-processing/ Summarization Data Analysis Data Generation
Astrophysics - Peele Post-processing/ Summarization Data Analysis Data Generation
Astrophysics - Myra Post-processing/ Summarization Data Analysis Data Generation
Applications Panel: part 2engaging the sciences • Topics • What percentage of time do you spend on data management related tasks? • What are these tasks? • Suppose these tasks are taken care of, what other technology you wish the SDM center will support? • How to you expect your software (simulation, analysis) to interoperate with SDM center software • How do you see the role of the SDM center in your application domain? • Providing support for your SDM needs • Applying technology that enables new science • Jointly develop new technology
Applications Panel: part 2engaging the sciences • Topics • Close collaborative projects • We believe that it is necessary to work with application scientists jointly in order to apply SDM technology • How do we achieve joint activities: joint funding, good will, advertising, tutorials? • Do you expect some SDM technology to be packaged and used directly from downloads? • Outreach • We believe that if we solve end-to-end use cases, the technology can be spread to the science communities by example • Do you agree? • Other ideas?
SDM center Panel Plans and Opportunities
Plans (next proposal) • Organizational issues • Plan: same participants • Collaborations: take time to build teams • ISICs encouraged to re-apply • Need to think on “steering the boat” • What does each participant want to work on(not to discuss now) • Funding levels – same? • Technical issues (next slides) • Is the concept of end-to-end attractive? • Is the concept of close collaboration where scientists “messenger of good news” attractive? • What technologies • How do we work with Apps • How do we evolve – labs and universities • Do labs and universities have different roles?
Lessons learned from success stories • What do we consider a success? • Successful use of SDM technology by application scientists • Productivity of scientist • Enable new science that could not be done previously • What does it take to apply SDM technology • Stages of activities • Development – of basic technology • Adaptation – to a specific application domain • Integration – into application framework • Deployment – get scientists to use the technology • Application interaction with all these stages • Problems are complex – requires close collaboration • Requires time commitment of an application scientist • Grid Collector example – ½ time of app scientist paid off
Vision: facilitating end-to-end data management • Support entire scenarios • e.g. Data generation, post-processing, analysis • Be willing to apply any technology necessary • SDM center technology • Adapt technology as necessary • Package technology as components • Integration of technologies • Make SDM technology components callable from workflow • Facilitate the use of scientific workflow tools • Manage launching of tasks • Manage data movement • Permit dynamic interaction with workflow • Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures • Need to work closely with app scientists • Identify end-to-end use cases (scenarios) • App scientist should be funded, too • App scientists are the “messengers of good news”
Lessons learned – technology (1) • Scientific workflow is an important paradigm • Coordination of tasks AND Management of data flow • Managing repetitive steps • Tracking, estimation • Efficient I/O is often the bottleneck • Technology essential for efficient computation • Mass storage need to be seamlessly managed • Opportunities to interact with Math packages • Searching and indexing • Searching over billions of objects • Searching in space/time • Searching in multi-dimensional space
Lessons learned – technology (2) • General analysis tools are useful • Statistical analysis, cluster analysis • Feature selection and extraction • Parallelization is key to scaling • Visualization is an integral part of analysis • Data movement is complex • Network infrastructure is not enough – can be unreliable • Need robust software to manage failures • Need to manage space allocation • Managing format mismatch is part of data flow • Metadata emerging as an important need • Description of experiments/simulation • Provenance • Use of hints / access patterns
Engaging application science communities • Close collaboration is essential for success • Technology adaptation to an application domain is the key to its use • End-to-end solutions should be developed where appropriate • Embed SDM solutions in other packages/frameworks • Math (e.g. parallel I/O for AMR in APDEC) • Application analysis (e.g. ROOT in HEP) • Experiment frameworks (e.g. STAR project)
Engaging application science communities • Need to fund application scientists for joint projects specialized for that application domain • Funds can be made available on as needed basis • Controlled by SDM center through advisory board • Funded application scientists can spread the word in his/her community (joint effort with SDM center) • Be part of application proposals, so their office funds the application scientist working with SDM center, or even some center activities • (e.g. Current attempt with PPPL)
SDM center and the role of CS basic research • Stages in technology development • Research, Prototype, Product, Infrastructure • Role of CS basic research • Research-to-prototype • Can afford risky projects since application scientists are not waiting for results • Longer term payoff (2-3 years) • Very important – research technology funneled into SciDAC • Role of SDM center in SciDAC • Prototype-to-Product • Low risk, apply technology that has been prototyped • Shorter term payoff (1-2 years) • Role of SDM center in Application Offices • Product-to-infrastructure • Software moves from "product" to "infrastructure" when sites start installing it by default • Adoption of the software by key groups • Requires jointly-funded collaborations (including App people)
SDM center and other ISICs • Math ISICs • Identify joint activities • e.g. parallel I/O in APDEC • SC ISICs • CCA technology for wrapping components to be used in workflows • PERC technology to identify I/O bottlenecks
Summary • SDM technology can successfully be applied across multiple scientific applications • Close collaboration to establish end-to-end solutions is needed • Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures (messengers of good news) • We recommend having a flexible funding structure to support application scientists on collaborative projects • Level of such funding structure should be in addition to the SDM center funding (about 20% of level of center’s funding) • Funding structure to be managed by SDM center using advisory board • We recommend the Center joining application side proposals when possible • Ability to use funds where technology is needed • Need help from OASCR to make such connections