90 likes | 106 Views
Breakout Plan:. 9:00 – 9:30 am Intro to the Breakout Quick introductions Review of ANL findings Plan for subgroup division and charge Reporting template discussions 9:30 am – 10:45 am Break into subgroups Fill in breakthrough slide (slide 1)
E N D
Breakout Plan: • 9:00 – 9:30 am Intro to the Breakout • Quick introductions • Review of ANL findings • Plan for subgroup division and charge • Reporting template discussions • 9:30 am – 10:45 am Break into subgroups • Fill in breakthrough slide (slide 1) • 10:45 am Begin slide on science impact (slide 2) • 11:45 am Give completed slides to breakout leads • 12:00 – 12:30 pm Get lunch and come back • 12:30 pm – 1:00 pm Finalize slide presentation • 1:00 pm - 2:30 pm Co-leads present report-back slides
Facilities Integration and AI Ecosystem Co-lead: Michael E. Papka Co-lead: Inder Monga Co-lead: James J. Hack Technical Writer: Scott Jones
List of breakout participants • Kalyan Perumalla • Carlos Soto • Suzy Tichenor • Torre Wenaus • Julia White • Sean Wilkinson • Da Yan • Junqi Yin • Inder Monga • Bobby Sumpter • Natalia Vasileva • Jay Bardhan • Arthur Bland • Jim Brandt • Guojing Cong • George Fann • James Hack • Sean Hearne • Scott Jones • Andrew Kail • Doug Kothe • Ralph Kube • Michael Matheson • Veronica Melesse Vergara • Bronson Messer • Michael Papka
Multi-facility integration and streaming data • Specific capabilities that need development • Automated flows of data in and out of the HPC environment (e.g., streaming, batch processing, … ) • Integrated instruments and experimental facilities need to be impedance matched • Role of edge processing needs to be considered in overall multi-facility workflows • On-demand needed for real-time feedback and control, characterization of instrument (e.g., computational steering) • Multi-facility scheduling across all resources that allows real-time processing of streaming data • Federated identity/instrument capabilities are an essential part of the solution • How much of the HPC system do you need? • Optimization is minimization of time to solution • e.g., Synthesis of materials needs faster turn-around; characterization of materials can live with longer time turn-around • Use AI methodologies to decide where generation/manipulation of data needs to happen (edge, HPC environment, in the cloud…) • Multi-parameter optimization problem including real costs
Cross-Facility optimization of applications • AI for improvements to and adjunct for simulation • Grand-challenge • Identifying appropriate data from science investigations across facilities exploring similar phenomenology, methods, computational workflows, to train and optimize using AI techniques • e.g., Insufficient data to optimize things like AMR techniques, where capturing data on simulations using AMR across multiple simulations, across labs, etc., may have value • Leverage other investments in development of trained AI models (see model management) • e.g., with appropriate metadata leverage community work • Finding and sharing datasets for training (e.g, metadata challenges, access time challenges, …) • e.g., If I need to optimize DFT simulation, can I get data from a broad range of DFT simulations, with the appropriate metadata, to collectively and thoroughly characterize simulation capabilities…
Model Management Community use and sharing of large AI models • How can the community exploit models developed by others? • Proprietary data may introduce other challenges (e.g., policy space)? • Training models for the community – provenance, what data was used to train?, etc. • Sharing of models- what are mechanisms, and responsibility of facilities for supporting/maintaining these capabilities? • Metadata standards relevant to modeling framework need development
AI environment (software and libraries) • Need for a scalable environment • Scaled up from local machine to HPC machine seamlessly • Benchmark/tests needed to ensure that the scaled-up version is correct? • SC-wide maintained repository of AI software packages that has metadata of architecture, scale, etc. • Proper abstractions so the user is protected from software stack variability, other packages used, and scalability changes of libraries • Abstraction layer helps the computing facilities vet and protect the underlying software infrastructure, so the user is not using unverified software packages (see cyber security) • Use of testbeds for new software and hardware
Optimize the operation of facilities • Using AI for facility operations optimization • Managing and organizing monitoring data coming from the facility, appropriate metadata as well; Are we gathering the right data? • Characterizing applications and understanding their ‘fingerprint’ to optimize how the application is configured for the particular facility (e.g., architectural awareness) • Development of APIs that will enable feedback to systems and users • Data from multiple facilities to optimize end-to-end scientific workflows that might use multiple resources across administrative domains • Policy, access, data sharing formats etc. etc.
Cybersecurity • Identifying bad actors on the machine, anomaly detection using the captured/sensed data exploiting AI techniques • Real-time action/response to identified problems • Identifying malicious code • Policies that allow data to be used across facilities in a way to preserve agreements and restrictions • Appropriate cybersecurity controls to allow easy streaming of data in and out of the facility, at high performance?