HPC USER FORUM ISV PANEL April 2010 Dearborn, MI

HPC USER FORUM ISV PANEL April 2010Dearborn, MI

Panel Members • Moderators • Alex Akkerman, Ford Motor Company • Sharan Kalwani, KAUST • Participants • Steve Feldman, CD-adapco • Matt Dunbar, Simulia • Uwe Schramm, Altair Engineering • Li Zhang, Livermore Software Technology Corporation • Barbara Hutchings, ANSYS, Inc. • Martin McNamee, MSC Software

Panel Format • 4 Questions • Provided ahead of time • 2 minutes per question for each participant • Follow-up and Audience after each participant had a chance to comment

Q1. Applications Scalability….. • Please share with the audience, briefly – the issues surrounding Applications Scalability and how is this being addressed?

Q1. Applications Scalability….. • Our solvers scale reasonably well to 512 cores or more with very large problems. • Very few actually use this many on one analysis today • Primary solver bottlenecks- • Memory bandwidth • Unbalanced work loads • Untapped speed potentials • Parallel Meshing • Parallel post-processing • Parallel I/O • Different algorithms, re-evaluation of methods

Q1. Applications Scalability….. • Fundamental limitations on scalability remain scalar sections of code (Amdahl’s law) and load balance • Solution is still developer time and effort which is being invested • Looking at ways to improve developer efficiency through use of better programming models (primarily from Intel and Microsoft) • Past programming model changes were either too limited (OpenMP), too immature, or simply ineffective • Newer models driven by need to bring multi-core execution to commodity applications have more promise

Q1. Applications Scalability….. • We’re talking finite element solver applications • Two classes of solvers • Interative schemes • Matrix inversion schemes • Issues: Scalability, Quality, Repeatability, Data transfer, Hardware configurations, Hardware access • Addressed: Optimal domain decomposition, Computational methods that scale well, Solver architecture, Focus on certain hardware, Partnerships

Q1. Applications Scalability….. • Consistency • Data summation order for different MPI causes errors – LS-DYNA uses fixed order. • Modified/refined model decomposes in different way changing results – ‘Cut lines’ are preserved from 1st run. • Scalability • Scaling for 128 processors not always good. • Hybrid LS-DYNA runs SMP within processor and MPP between processors. • Results consistent with increased # SMP threads. • Simple command line to execute.

Q1. Applications Scalability….. • Solver scaling continues to expand • CFD to 1000, Structures to 100 • Especially key to accelerating transients • Need to address the scalar bottlenecks • Across full simulation process (meshing, I/O, certain solver physics, visualization) • Hybrid parallel algorithms for multi-core/mixed-core • Distributed/shared memory, OpenMP vs. MPI • Support for latest communication technologies • QDR IB, iWARP for 10gigE, etc.

Q1. Applications Scalability….. • Hybrid parallel model • Shared memory parallel within a compute node (multi and many cores) • Distributed memory parallel across nodes in a cluster via MPI • Domain decomposition • Study and collaboration with industry and academia • Software componentization • High level modules and math functions • Low level math kernels

Q2. Licensing Model..... • As hardware technology shift to multi-core processors continues and even accelerates, the licensing models of many ISV codes become a serious problem for your customers. Per core licensing becomes exceedingly unaffordable and limiting in ability to improve and even maintain the levels of performance of recent past. Panel participants – How can you help your customers become more competitive given current technology trends?

Q2. Licensing Model..... • MSCSoftware licensing is undergoing a revision and not at a point for public release • However MSCSoftware is aware and recognizes a need for affordable price point for parallel processing

Q2. Licensing Model..... • Modified ANSYS HPC licensing in 2009 • Tied to the value of HPC • New scalable licensing enables ‘extreme/unlimited parallel’ for high fidelity; minimizes the licensing “penalty” on higher core count processors • Enterprise access is key • Hardware located anywhere, users located anywhere • Owned, rented, IaaS • Interchangeable across physics • Buy once, deploy once

Q2. Licensing Model..... • One Code Strategy – LS-OPT, LS-PrePost, Dummies, barriers & head forms FEA models all available as part of LS-DYNA distribution with no additional license keys required. • Ultimate Value: Multi-physics & multi-stage capabilities in one scalable code. • Flexibility: 4 core license allows 4 one core jobs or one 4 core job. • Steeply decreasing licensing fees per core as the # processors increase. • Unlimited core site license.

Q2. Licensing Model..... • Not have licensing based on number of cores • Per use token-licensing • Addressing thru special license decay • Multi-run environments • Massive computation

Q2. Licensing Model..... • Have asked this question internally and am presenting collective responses • Two factors in license price • Parallel development and testing are more expensive than scalar • With SIMULIA typical sale is annual license, so, on the one hand, sales force is motivated to maintain a good relationship with customer, but on the other hand the sales force is fearful of “revenue erosion” from “free parallel” • SIMULIA sales team view existing licensing model which rewards parallel execution with lower “per core hour” execution costs • Requires greater base token pool • Change? • Requires “revenue neutral” shift in licensing model • Great volume of sales (more customers)

Q2. Licensing Model..... • Our “Power” session licenses are independent of the number of cores used for a single analysis • Our “Cloud license” model is also independent of the number of cores and of the number of simultaneous analyses. You pay only for what you use and we do not care where you run. • We make our clients more competitive by adding value with each release: • Cut the total engineering time required for analysis. Engineering time is far more expensive than computer time. • Enlarge the universe of problems that our tools can be employed to analyze while working to make all our analysis more accurate.

Q3. New Technology Adoption….… We notice a considerable lag in adoption of new technologies (e.g. FPGA, GPGPU) in the Manufacturing CAE space. Please elaborate on what are the issues and your response.

Q3. New Technology Adoption….… • New technologies come with lots of hype and little infrastructure. It takes time for languages, compilers, debuggers,… to mature and standardize. We are not in a position to rewrite 1M lines in assembly language every time some new device appears. • New technologies are not always applicable to our particular needs. • When we see a technology with reasonable potential for return on investment, we partner with the technology providers, watch the literature, assign researchers… and it does not always pay off.

Q3. New Technology Adoption….… • Adoption of GPGPU, and to greater extent, FPGA has high programming cost for large, general purpose codes • Result is that GPGPU focus tends to be on acceleration of obvious bottlenecks, preferably, with low code line counts • Drawback for parallel codes is that often greater parallel gains are in same areas, so gains from GPGPU are considerably less for parallel codes than for scalar codes • Even where adoption is underway, keeping x86 and GPGPU code (CUDA/OpenCL) results in two code bases • SIMULIA is accelerating obvious code with GPGPU, and working internally and with partners to find better programming model

Q3. New Technology Adoption….… • Technology need to be fit a for certain computational methods – memory, data transfer • We’re trying, but the gains do not justify the effort • Technology is not where it needs to be • Lack of standards

Q3. New Technology Adoption….… • LSTC currently is evaluating the impact of GPUs on the performance of implicit LS-DYNA. It is applied to the innermost computational kernel of the sparse matrix factorization. • GPUs offer high performance for certain computational kernels. • Performance is subjected to overhead cost of transferring the data to the GPU and results back from the GPU. • Performance will no longer degrade for REAL*8 arithmetic when the Nvidia Fermi GPUs become available. • LSTC hopes to have the GPU implementation in Implicit around mid-year.

Q3. New Technology Adoption….… • Establishing ROI is critical - and unclear • Moving technology target (CPUs vs GP-GPUs) • Substantial investment required • Only a subset of operations map to GPU without significant algorithm changes • Bottleneck associated with memory access to/from off-CPU boards; Not enough memory to offload “entire algorithm” • Lack of ‘off the shelf” vendor libraries; multiple development environments (OpenCL / CUDA) • Some “low hanging fruit” (e.g., matrix factorization) • Available now (beta) on GP-GPUs

Q3. New Technology Adoption….… • Product (GPGPU/FPGA) robustness • Porting cost • Actual vs. theoretical speedup

Q4. Breakthrough Performance….. Could you please comment on how your products could potentially evolve near- or mid-term leading to substantially higher levels of performance for your customers?

Q4. Breakthrough Performance….. • Near term • potential speedup due to hardware interfaces to de-coupled numerical modules and math kernels • Very little porting cost • Mid term • Enable use of large shared memory systems • Implied assumption of resource availability • Relax legacy requirements to operate in small memory • Greater use of multi and many core

Q4. Breakthrough Performance….. • New, more scalable solvers • With promise to extend scaling to 1000’s+ core • Robustness is key (takes time) • Vector processing paradigms (multicore, GPU) • Parallel execution of multiple design points • Full automation of parametric updates • Human productivity and compute throughput

Q4. Breakthrough Performance….. • New features continuously implemented (Electromagnetics, Acoustics, Frequency response, Compressible/incompressible fluids, Isogeometric elements). • Multiscale capabilities under development to have initial release this year. • Hybrid MPI/OPENMP promises major scalability boost at high # processors for both explicit and implicit solutions – scaling to 1000’s of nodes for both explicit & implicit solvers. • Replace prototype testing by simulation: • Strict modeling guidelines for analysts • A single FE model for crash, NVH, durability, etc. • Advance in Constitutive models, Contact, FSI with SPH, ALE, Particle methods, Sensors and Control Systems, and complete compatibility with NASTRAN • Manufacturing simulations (in LS-DYNA, Moldflow, etc.) to provide initial conditions for crash simulations.

Q4. Breakthrough Performance….. • Not only the solver runtime is important, but how the solver use impacts design decision • New paradigms of designing products • Integration of design methods with solvers • Advancing use of multi-CPU • Advancing numerical techniques

Q4. Breakthrough Performance….. • Abaqus/Explicit is unlikely to see major breakthrough in near to mid term, but will show steady incremental improvement • Customer base execution of Abaqus/Standard for large jobs exceeded customer adoption about 3 years ago (large model scaling to 128 to 256 cores) • Takes time for customers to get credible performance data and to change hardware available in order to adapt to a shift in scalability • For implicit FEA hard to get away from “Nastran node” for several years • SIMULIA investigating next possible jumps in performance • For Abaqus/Standard working on “strong scaling” gains (i.e. deliver scalability throughout problem size range) • Beyond the “more cores” approach potential of GPGPU is great, but need a programming breakthrough

Q4. Breakthrough Performance….. • Our goal is to cut down the TOTAL simulation time. • Meshing/CAD interfacing. We have gone from weeks of preparation time to hours. We have made enormous breakthroughs in our ability to process “dirty” CAD. • Post-processing – recently cut the time to output a specific set of plots from 40 hours to 1 hour. • Strategies to deal with larger models and transients, including use of parallel I/O. • Customization and integration with the client’s own workflows and processes. • Solver efficiency alone is not the only important measure of performance.

HPC USER FORUM ISV PANEL April 2010 Dearborn, MI