100 likes | 115 Views
Explore the challenges of using industry AI software for DOE's unique hardware and software ecosystems. Address data management, reliability, productivity, and usability concerns. Discuss opportunities for AI in software development.
E N D
Software Environment and Software Research Eric Cyr, SNL Judy Hill, ORNL Robert Patton, ORNL
DOE needs cannot be entirely met by the Industry AI software stack (e.g. pyTorch, Tensorflow) • Our data sets may be larger or have different structure than what is typically used by industry (e.g. large data but a small number of samples, multi-modal data) • Industry AI software stack may not be suited for our unique hardware/software ecosystems • Real-time requirements may require architectures like FPGAs for the trained networks • Future leadership systems may also have unique hardware • Opportunity for physics-informed AI, and a desire to have insight into the decisions that are made • We have security needs – integrity of the source code, system software level security, networks, incoming data, etc. Can we verify the AI has been trained correctly and without unintentional bias or malice. • An AI “recipe book” is needed by many domain scientists to guide application of AI technologies. • We need more education and co-design – more training, and then ML/AI experts embedded together with the domain experts to answer DOE questions to understand the ML/AI research needs
Software for Data Management • Crosscut Breakout • There are aspects of AI based workflows that require more capable data management software • Self-describing file formats – Extending data formats like ADIOS, HDF to include provenance (more AI friendly file formats); how do we share models – formatting (Data mgmt.) • Diversity of data sets, modalities, sizes, etc… • Indexable and archivable data sets – how do we describe our data sets they can be used for training • Mobile data broker – sharing data sets (policy, security is embedded
Software challenges supporting DOEs unique hardware ecosystems • Domains have a spectrum of needs, ranging from edge computing to user facility HPC to leadership platforms. Industry solutions likely don’t span the spectrum of DOE needs. • Scalable (HPC down to an individual device) solutions that take advantage of their full capabilities. • FPGAs, specialized AI hardware (e.g. TPU) and computing on the edge – AI software stack has to address this full range of needs • Programming models and software stack optimized for specialized hardware • Neuromorphic computing software needs
Modularity and Composability • How does AI software interact with the existing DOE software ecosystem? • Automated integration of AI and ModSim capabilities • API scanner (for scientific workflows) – looking at interoperability in software • Interoperability (modularity) - AI building blocks, could also include mod/sim software • How do we modify existing DOE software to interact with AI • Able plug in DOE software into AI software (and the other way around) • AI workflow management
Reliability of AI software • DOE has unique problems which require a high level of certainty in the answer (certification of the stockpile, control systems in facilities). • It is an open research question as to how to do verification and certification on a non-deterministic, data-drive system (to be done together with the mathematicians). • What is an effective unit-test for AI?
Productivity and Usability • Domain scientists don’t necessarily understand the full capabilities of AI and need more guidance. Similarly, traditional HPC software developer tools may not be suited for AI workflows. • Domain scientists want an “AI Recipe Book” for science indicating appropriate techniques for particular classes of problems • Software developers need data sets suitable for testing and development (including revision control). They also need debugging and performance tools suitable for AI workflows, which may have different characteristics than traditional HPC mod/sim. Software is doing hyper parameter search to find nearly optimal ML models (“autoML”). • ML Ops, machine learning workflows • ”AI Suit” bringing domain scientists personal environment with them
Gap in Town Hall Discussion – Opportunity for AI in computer science AI-Enhanced Software Development (Software 2.0) • Software Environments and Software Research • Recommender systems for software development based on AI – e.g. code generation, compiler technology, bug tracking systems, performance tools. • Opportunity to use AI as part of the software development workflow. • Automated testing • Automated code reviews • We wanted to spend more time in our session on this topic, but felt it was not as relevant as meeting the domain science needs.
Some Initial Ideas • CS as an Application Domain: Higher Education for Compilers • Software Ecosystem/Software Research • Leveraging AI Industry Software Stack for Science (and Gaps) • AI is a hot area for industry and under continuous development, but does their software stack meet scientific needs? • What is the role of existing solver frameworks (e.g. Petsc, Trilinos, etc) in being augmented with AI/ML capabilities? • What about scalability? Do these products scale to HPC, and do our apps require it? • Real-Time and Edge Computing • Many applications expressed an interest in computing at the edge (i.e. next to the sensors). How does this impact the typical DOE software stack designed for enterprise computing? • Can we take ideas from the industrial use and bring it into DOE software stack? (What do we steal from them?) • Verification and Validation of AI Software • AI is a “black box”. How do we verify and validate any software that might be developed? • Education of application scientists • What is AI and what can it do and not do? What data can it work on? • How do we educate domain scientists so they don’t misuse any software developed outside of the regime its intended for? • Productivity of AI developers • Are there any special productivity needs of AI software developers? (that differ from HPC developers). • Software development needed for data management • What specific needs in software development are required for any data management? Special software engineering needs? In-situ needs? • Are agent-based models relevant with disparate data sources? • Digital twin (simulation paired with every experiment) • Are there specific software needs for digital twins?
Immediate needs: We need from domain scientists exemplar problems so that we can make a cookbook in 5 years • Future needs: How to store data, analyze data, spectrum • Research needs: 5 to 10 years from now • Scientific autoML - Modularity and composibility/ Data/model management • Complete ecosystem for scientific ML (proxy apps, benchmark data sets, some ML methods, how to read files/manage files) – Trilinos/Petsc for scientific ML (All of the above) • Hybrid HPC AI (hardware / software) – mixing (Hardware?) • Programming models – mixed precision for AI(productivity, maintainability of the code – will TensorFlow be around?) need an “MPI” for AI • API scanner (for scientific workflows) – looking at interoperability in software (Modularity) • Self-describing file formats – Extending data formats like ADIOS, HDF to include provenance (more AI friendly file formats); how do we share models – formatting (Data mgmt.) • Indexable and archivable data sets – how do we describe our data sets they can be used for training (Data mgmt.) • Mobile data broker – sharing data sets (policy, security is embedded) (Data mgmt.) • Software that supports the “AI suit” – support your AI for science at different locations (Modularity) • Verifiable AI software – Want to know that my intern didn’t write something wrong (V&V) • Trustworthy AI (explainable, reasonable) – UQ for AI? (V&V) • Interoperabilty (modularity) - AI building blocks, could also include mod/sim software (Modularity) • AI focused performance and debugging tools – what does scalability and performance look like for AI? (Modularity/V&V) • AI based code generators (SPECIAL TOPIC) • Goal: AI as a usable tool for science • Edge computing or federated computing (Hardware) • How to record, save, and replay (reproducibility) the AI workflows. (V&V) • Scalability of AI methods (Hardware)