240 likes | 244 Views
Explore the challenges and opportunities in retooling the data lifecycle, improving data infrastructure, and making multi-source data available for AI analytics in scientific user facilities.
E N D
Data Collection, Reduction, Analysis and Imaging for Scientific User Facilities Crosscut: Data Infrastructure and Lifecycle Katie Knight, ORNL, Co-Lead Brad Settlemyer, LANL, Co-Lead Arjun Shankar, ORNL, Co-Lead and Katie Jones, ORNL, Science Writer August 21, 2019 Breakout for ORNL AI for Science Town Hall
Participation and process • 43 on the list • About 30 participants • A set of self-identified domain scientists – about 7 (but vocal - ) • Process • Spent first 45 minutes: reviewing homework on ~50 submissions, yesterday’s domain read-outs, categorizing and collating input from the breakout • Split into three main sub-breakout topics • Distilled to 1-2 slides per sub-breakout, and reviewed as a group. • Three sub-breakouts • Re-tooling the Data Lifecycle to Facilitate AI • Data Infrastructure for AI • Make Multi-Source and Disparate/Distributed Data Available to AI Analytics
Discussion Converged on Sub-Breakout Topics • Re-tooling the Data Lifecycle to Facilitate AI • Data Infrastructure for AI • Make Multi-Source and Disparate/Distributed Data Available to AI Analytics
1. Retooling the Data Lifecycle to Facilitate AI Image credit: Suhas Somnath
Retooling the Data Lifecycle to Facilitate AI (contd.) • Challenges • Metadata Standardization and Collection • Data Policies (incentivizing sharing, best practices, migration, etc.) • Data Stewardship (where stored, how long is it kept, who will pay for it) • Opportunities • Can AI be leveraged to assist with metadata cleaning and standardization (classifiers)? Data weeding? • AI-enabled metadata microservices / preparation and cleaning tools / data workflows for domain-specific data management needs • AI-enabled management of “ever-evolving schemas”, curation, etc. • How Should DOE Organize? • Establishing and encouraging data best practices • Incentivize those who share data of value – cost model. Establish or use a “d-index” • Identify / make exemplar of data-aware group for AI pilot
2. Data Infrastructure for AI Data Discovery & Linking - Enhance infrastructure to support data discovery and exchange/infer links between independent, related data sets • Grand Challenge Impacts: Multi-messenger Physics (Astronomy, Fusion, Fund Physics), Measurement Silos (Climate, Transport&Mobility), Link discovery (Materials) Incentivizing Curated Data - Infrastructure that recognizes the value of curated data, supports annotation, incentivizes sharing and reduces duplicated efforts in data processing • Grand Challenge Impacts: Edge sensors (Trans&Mobility, Climate, AM), Logs (Facilities)
Data Infrastructure for AI (contd.) Converged AI and HPC Data Infrastructure - Converged data infrastructure to support modsim data workloads and emerging AI workloads seamlessly. Enable efficient access for fundamental AI algorithms (e.g. training, re-training, inferencing) as first class workloads • Grand Challenge Impacts: Support for massive AI parameter spaces (Fusion, Climate), Combination of simulation and AI (all) Infrastructure for Policy Enforcement/Sharing - Infrastructure for supporting the sharing/access protocols emerging for new AI data sets (protecting data at rest and controlling data motion) • Grand Challenge Impacts: data regulations (Health Care), control data (Energy Generation, Fusion, AM) Distributed Workflows - Infrastructure supporting workflows across multiple facilities (experimental, simulation, observational/edge, long-term data storage) enabling short time-scale control (including real-time control) • Grand Challenge Impacts: Experiment control (Fusion, Materials, Energy Generation, Manufacturing), Multi-messenger Physics (Astronomy, Climate)
3. Make Multi-Source Disparate/Distributed Data Available to AIEnable domain-independent data layer to domain-dependent analysis • Sub-challenges • Multi-modal data linking embedded in analytics for domain-dependent AI [Health/Biology, Transportation, Nuclear Structure] • Creating data-exchanges and markets federating facilities for IOT AI [Transportation, Energy Grid, ..] • Policies to balance availability/shareability and privacy. AI hindered by lack of data. [Health, Nuclear] • Impact • Enables scale-first thinking and making data persistent, fluid, and reusable • Helps us ascribe value to the data, value from the analysis; curate using AI • Creates pipelines to share and transform scientific data across shapes/structures/formats • Unlock data through markets and exchanges
Data Collection, Reduction, Analysis and Imaging for Scientific User Facilities Crosscut: Data Infrastructure and Life Cycle Katie Knight, ORNL, Co-Lead Brad Settlemyer, LANL, Co-Lead Arjun Shankar, ORNL, Co-Lead and Katie Jones, ORNL, Science Writer August 21, 2019 Breakout for ORNL AI for Science Town Hall
Agenda • Review charge. 5 mins • Context and relevant submissions. 15 mins. • Input from domain speakers and response on topics areas straw man. 15 mins. • Divide into topic areas. • Identify lead responsible for the sub breakout slide and note taker. • Reconvene with lunch at 12:00 pm • Groups to provide main issue as top line on one slide. • Challenge problem; Why this is a challenge; What is the impact; How should DOE organize to tackle this • Reassemble and plan to put up three to four slides or so for report back
Slide link: https://bit.ly/30oiAWz - no longer current after breakout.. Infrastructure - Brad Lifecycle - Katie Analytics + Markets – Ketan (Arjun nudging towards coherence and our deadline..) Pickup lunch at noon, plan to converge on 1 slide each by 12:15
Discussion/Input at the start of the breakout • Data Infrastructure • Data transfer • Security, Access control • Stream • Delivery mechanism (HPC style or Coud style?); declarative mechanisms? • Timeliness (how does data age?); • Tool chain – should be sponsor policy aware • AI for infrastructure • Reach in to all labs/data • Cost models for data (what technology are we using?) • Search (also lifecycle) • Edge-fog-cloud • Data Lifecycle – things about the data • Data Representation, Meta-Data; Provenance; Quality metrics (domain specific) – capture uncertainty; Need assertion from source; Large time in data cleaning; allow caveats to be captured retroactively; • Dissemination/Publication • Integrity of the data • What do we want to keep; domain-aware • Distinct phases: create, store, use, update, publish, delete; policies – what does it mean for using it for AI – depends on the goal (e.g., Astro/LSST example – has clear policy and protocol)? • Persistent identifiers for data and algorithms • Versioning • Value metrics to help infrastructure and policies to help cost models; • Curation (human-driven, AI-driven?) • Data Analytics – functionality focused • Fusion of Data, multimodal • Linked data (e.g., Extract data from all publications) • Data + Algorithms/methods – for repeatable science • Trading off re-computing vs. storing the data (approximations, reductions) – connects to representation • Particular analytics for quick turn around (e.g., manufacturing) – so need the right delivery mechanism/tools • Synthetic data creation and use • Data markets and reachability (includes norms, ethics, law etc.), e.g., create agreements for availability (program them?); Data has value – crosscuts all of the above (infrastructure that facilitates, cross domain lifecycle, etc.) • Complex access models • Need foundational thinking : secure enclaves etc. • Make DOE aware of the policies in interagency context • Data access while being secure <- grand challenge?
Data Lifecycle (contd.) How should DOE organize to tackle this? Metadata Cleaning / Curation • Can AI be leveraged to assist with metadata cleaning and standardization (classifiers)? Data weeding? • AI-enabled Metadata microservices / cleaning tools / data workflows for domain-specific data management needs • Metadata Lake (to train classifiers) • Synthetic Metadata (to train classifiers) • Enable generation of “ever-evolving schemas” Data Policies • Incentivize those who share data of value – cost model. Establish a “d-index” • Use AI to enable social change • Needs to be an institutional (national?) infrastructure to support AI • Who will pay for data storage?
Data Lifecycle (contd.) How should DOE organize to tackle this? Data Storage and Sharing • AI that quantifies that you saved $x because data already exists somewhere • Recommender agents (for data quality and experiment design, anomaly detection) • Contingent on descriptive, interoperability, administrative, semantic, preservation, and provenance metadata (see cleaning and curation) Metadata for Data Types • Static vs. Streaming Data • DOI for streaming data? • Challenges of heterogeneous data in streams • Provenance / administrative metadata to inform AI classifiers (context)
What is challenge problem? Why this is a challenge? What is the impact? How should DOE organize to tackle this? • Gathering good quality data for AI: • Intents, error bars, biases, etc. • All relevant metadata • What do we do with data that someone doesn’t care about storing? – Last resort archival place, who will pay for all this? • Social (& policy) challenge - (Institutional) carrots: • This is MUCH harder than the technical challenges • forces sharing, • comprehensive metadata, • discoverability. • $ incentivize those who share data of value – cost model. Establish a “d-index” • Institutions / user facilities (E.g. SNS) take ownership of data even though funding agencies says otherwise • DOI for streaming data • Compression may be applied because of bandwidth limitations. This should be recorded so the network can take this into account. • AI for streaming data. How would this be different from a constant sized dataset? • AI that quantifies that you saved $X because you already have data • Federated data repository • AI can be used to create an ever-evolving metadata schema. AI for extracting only necessary metadata • Recommender agent at the experiment end-station for improving the quality of data @ acquisition / generation • Context provider – prior data and articles • Describing data – metadata • Feature extraction and anomaly detection • Dumpster diving • Agent That Works with along the lifecycle • Policies, provenance & licenses • Who can use? • Can you trust this (provenance) • Integrity • Is it FAIR? • Creating a scientific version of a data lake • Not preemptively indexing the world • Learn based on synthetic data / metadata