Enhancing Data Infrastructure for AI in Scientific User Facilities

Explore the challenges and opportunities in retooling the data lifecycle, improving data infrastructure, and making multi-source data available for AI analytics in scientific user facilities.

Enhancing Data Infrastructure for AI in Scientific User Facilities

Presentation Transcript

  1. Data Collection, Reduction, Analysis and Imaging for Scientific User Facilities Crosscut: Data Infrastructure and Lifecycle Katie Knight, ORNL, Co-Lead Brad Settlemyer, LANL, Co-Lead Arjun Shankar, ORNL, Co-Lead and Katie Jones, ORNL, Science Writer August 21, 2019 Breakout for ORNL AI for Science Town Hall

  2. Participation and process • 43 on the list • About 30 participants • A set of self-identified domain scientists – about 7 (but vocal - ) • Process • Spent first 45 minutes: reviewing homework on ~50 submissions, yesterday’s domain read-outs, categorizing and collating input from the breakout • Split into three main sub-breakout topics • Distilled to 1-2 slides per sub-breakout, and reviewed as a group. • Three sub-breakouts • Re-tooling the Data Lifecycle to Facilitate AI • Data Infrastructure for AI • Make Multi-Source and Disparate/Distributed Data Available to AI Analytics

  3. Signed up Participants(present - Y)

  4. Discussion Converged on Sub-Breakout Topics • Re-tooling the Data Lifecycle to Facilitate AI • Data Infrastructure for AI • Make Multi-Source and Disparate/Distributed Data Available to AI Analytics

  5. 1. Retooling the Data Lifecycle to Facilitate AI Image credit: Suhas Somnath

  6. Retooling the Data Lifecycle to Facilitate AI (contd.) • Challenges • Metadata Standardization and Collection • Data Policies (incentivizing sharing, best practices, migration, etc.) • Data Stewardship (where stored, how long is it kept, who will pay for it) • Opportunities • Can AI be leveraged to assist with metadata cleaning and standardization (classifiers)? Data weeding? • AI-enabled metadata microservices / preparation and cleaning tools / data workflows for domain-specific data management needs • AI-enabled management of “ever-evolving schemas”, curation, etc. • How Should DOE Organize? • Establishing and encouraging data best practices • Incentivize those who share data of value – cost model. Establish or use a “d-index” • Identify / make exemplar of data-aware group for AI pilot

  7. 2. Data Infrastructure for AI Data Discovery & Linking - Enhance infrastructure to support data discovery and exchange/infer links between independent, related data sets • Grand Challenge Impacts: Multi-messenger Physics (Astronomy, Fusion, Fund Physics), Measurement Silos (Climate, Transport&Mobility), Link discovery (Materials) Incentivizing Curated Data - Infrastructure that recognizes the value of curated data, supports annotation, incentivizes sharing and reduces duplicated efforts in data processing • Grand Challenge Impacts: Edge sensors (Trans&Mobility, Climate, AM), Logs (Facilities)

  8. Data Infrastructure for AI (contd.) Converged AI and HPC Data Infrastructure - Converged data infrastructure to support modsim data workloads and emerging AI workloads seamlessly. Enable efficient access for fundamental AI algorithms (e.g. training, re-training, inferencing) as first class workloads • Grand Challenge Impacts: Support for massive AI parameter spaces (Fusion, Climate), Combination of simulation and AI (all) Infrastructure for Policy Enforcement/Sharing - Infrastructure for supporting the sharing/access protocols emerging for new AI data sets (protecting data at rest and controlling data motion) • Grand Challenge Impacts: data regulations (Health Care), control data (Energy Generation, Fusion, AM) Distributed Workflows - Infrastructure supporting workflows across multiple facilities (experimental, simulation, observational/edge, long-term data storage) enabling short time-scale control (including real-time control) • Grand Challenge Impacts: Experiment control (Fusion, Materials, Energy Generation, Manufacturing), Multi-messenger Physics (Astronomy, Climate)

  9. 3. Make Multi-Source Disparate/Distributed Data Available to AIEnable domain-independent data layer to domain-dependent analysis • Sub-challenges • Multi-modal data linking embedded in analytics for domain-dependent AI [Health/Biology, Transportation, Nuclear Structure] • Creating data-exchanges and markets federating facilities for IOT AI [Transportation, Energy Grid, ..] • Policies to balance availability/shareability and privacy. AI hindered by lack of data. [Health, Nuclear] • Impact • Enables scale-first thinking and making data persistent, fluid, and reusable • Helps us ascribe value to the data, value from the analysis; curate using AI • Creates pipelines to share and transform scientific data across shapes/structures/formats • Unlock data through markets and exchanges

  10. Working Slides During Breakout

  11. Data Collection, Reduction, Analysis and Imaging for Scientific User Facilities Crosscut: Data Infrastructure and Life Cycle Katie Knight, ORNL, Co-Lead Brad Settlemyer, LANL, Co-Lead Arjun Shankar, ORNL, Co-Lead and Katie Jones, ORNL, Science Writer August 21, 2019 Breakout for ORNL AI for Science Town Hall

  12. Agenda • Review charge. 5 mins • Context and relevant submissions. 15 mins.  • Input from domain speakers and response on topics areas straw man. 15 mins.  • Divide into topic areas. • Identify lead responsible for the sub breakout slide and note taker. • Reconvene with lunch at 12:00 pm • Groups to provide main issue as top line on one slide. • Challenge problem; Why this is a challenge; What is the impact; How should DOE organize to tackle this • Reassemble and plan to put up three to four slides or so for report back

  13. Slide link: https://bit.ly/30oiAWz - no longer current after breakout.. Infrastructure - Brad Lifecycle - Katie Analytics + Markets – Ketan (Arjun nudging towards coherence and our deadline..) Pickup lunch at noon, plan to converge on 1 slide each by 12:15

  14. Discussion/Input at the start of the breakout • Data Infrastructure • Data transfer • Security, Access control • Stream • Delivery mechanism (HPC style or Coud style?); declarative mechanisms? • Timeliness (how does data age?); • Tool chain – should be sponsor policy aware • AI for infrastructure • Reach in to all labs/data • Cost models for data (what technology are we using?) • Search (also lifecycle) • Edge-fog-cloud • Data Lifecycle – things about the data • Data Representation, Meta-Data; Provenance; Quality metrics (domain specific) – capture uncertainty; Need assertion from source; Large time in data cleaning; allow caveats to be captured retroactively; • Dissemination/Publication • Integrity of the data • What do we want to keep; domain-aware • Distinct phases: create, store, use, update, publish, delete; policies – what does it mean for using it for AI – depends on the goal (e.g., Astro/LSST example – has clear policy and protocol)? • Persistent identifiers for data and algorithms • Versioning • Value metrics to help infrastructure and policies to help cost models; • Curation (human-driven, AI-driven?) • Data Analytics – functionality focused • Fusion of Data, multimodal • Linked data (e.g., Extract data from all publications) • Data + Algorithms/methods – for repeatable science • Trading off re-computing vs. storing the data (approximations, reductions) – connects to representation • Particular analytics for quick turn around (e.g., manufacturing) – so need the right delivery mechanism/tools • Synthetic data creation and use • Data markets and reachability (includes norms, ethics, law etc.), e.g., create agreements for availability (program them?); Data has value – crosscuts all of the above (infrastructure that facilitates, cross domain lifecycle, etc.) • Complex access models • Need foundational thinking : secure enclaves etc. • Make DOE aware of the policies in interagency context • Data access while being secure <- grand challenge?

  15. Background, homework, and additional slides..

  16. Data Lifecycle (contd.) How should DOE organize to tackle this? Metadata Cleaning / Curation • Can AI be leveraged to assist with metadata cleaning and standardization (classifiers)? Data weeding? • AI-enabled Metadata microservices / cleaning tools / data workflows for domain-specific data management needs • Metadata Lake (to train classifiers) • Synthetic Metadata (to train classifiers) • Enable generation of “ever-evolving schemas” Data Policies • Incentivize those who share data of value – cost model. Establish a “d-index” • Use AI to enable social change • Needs to be an institutional (national?) infrastructure to support AI • Who will pay for data storage?

  17. Data Lifecycle (contd.) How should DOE organize to tackle this? Data Storage and Sharing • AI that quantifies that you saved $x because data already exists somewhere • Recommender agents (for data quality and experiment design, anomaly detection) • Contingent on descriptive, interoperability, administrative, semantic, preservation, and provenance metadata (see cleaning and curation) Metadata for Data Types • Static vs. Streaming Data • DOI for streaming data? • Challenges of heterogeneous data in streams • Provenance / administrative metadata to inform AI classifiers (context)

  18. What is challenge problem? Why this is a challenge? What is the impact? How should DOE organize to tackle this? • Gathering good quality data for AI: • Intents, error bars, biases, etc. • All relevant metadata • What do we do with data that someone doesn’t care about storing? – Last resort archival place, who will pay for all this? • Social (& policy) challenge - (Institutional) carrots: • This is MUCH harder than the technical challenges • forces sharing, • comprehensive metadata, • discoverability. • $ incentivize those who share data of value – cost model. Establish a “d-index” • Institutions / user facilities (E.g. SNS) take ownership of data even though funding agencies says otherwise • DOI for streaming data • Compression may be applied because of bandwidth limitations. This should be recorded so the network can take this into account. • AI for streaming data. How would this be different from a constant sized dataset? • AI that quantifies that you saved $X because you already have data • Federated data repository • AI can be used to create an ever-evolving metadata schema. AI for extracting only necessary metadata • Recommender agent at the experiment end-station for improving the quality of data @ acquisition / generation • Context provider – prior data and articles • Describing data – metadata • Feature extraction and anomaly detection • Dumpster diving • Agent That Works with along the lifecycle • Policies, provenance & licenses • Who can use? • Can you trust this (provenance) • Integrity • Is it FAIR? • Creating a scientific version of a data lake • Not preemptively indexing the world • Learn based on synthetic data / metadata

