Considerations for Scalable CAE on the SGI ccNUMA Architecture

Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications Christian Tanasescu CAE Applications Manager

Topics of Discussion Historical Trends of CAE Current Status of Scalable CAE Future Directions in Applications

Motivation for CAE Technology Economics: Physical prototyping costs continue Increasing Engineer more expensive than simulation tools MSC/NASTRAN Simulation Costs (Source: General Motors) 1960 $30,000 1999 $0.02 Mainframes Cost of CAE Simulation CAE Engineer vs. System Costs (Source: Detroit Big3) Engineer $36/hr System $1.5/hr Cost Cost of CAE Engineer Cost of Physical Prototyping Workstations and Servers 1960 2000 Years

Recent Technology Achievements Rapid CAE Advancement from 1996 to 1999 Computer Hardware Advances: Processors:Ability to “hide” system latency Architecture:ccNUMA: Crossbar switch replaces shared bus

Recent History of Parallel Computing Late 1980’s: Shared Memory Parallel Hardware:Bus-based shared memory parallel (SMP) Parallel Model:Compiler enabled loop level (SMP fine grain) Characteristics:Low scalability (2p to 6p) but easy to program Limitations:Expensive memory for vector architectures Early 1990’s: Distributed Memory Parallel Hardware:MPP and cluster distributed memory parallel (DMP) Parallel Model:DMP coarse grain through explicit message passing Characteristics:High scalability (> 64p) but difficult to program Limitations:Commercial CAE applications generally unavailable Late 1990’s: Distributed Shared Memory Parallel Hardware:Physically DMP but logically SMP ccNUMA Parallel Model:SMP fine grain, DMP and SMP coarse grain Characteristics:High scalability and easy to program

Origin ccNUMA Architecture Basics Features of ccNUMA Multi-purpose Architecture Detail of Two Node (w/Router) Architecture (32p Topology) Node Proc. Proc. Proc. Proc. Cache Cache Cache Cache I/O I/O Local Switch Local Switch Main Memory Main Memory Dir Dir Router Global Switch Interconnect

Parallel Computing with ccNUMA Features of ccNUMA Multi-purpose Architecture • Origin2000 ccNUMA available since 1996 • Non-blocking crossbar switch as interconnect fabric • High levels of scalability over shared bus SMP • Physical DMP but logical SMP (synchronized cache memories) • 2 to 512 MIPS R12000/400Mhz processors with 8MB L2 cache • High memory bandwidth (1.6Gb/s) and I/O that is scalable • Distributed and shared memory (fine and coarse) parallel models Origin2000/256

Recent Technology Achievements Rapid CAE Advancement from 1996 to 1999 Computer Hardware Advances: Processors: Ability to “hide” system latency Architecture: ccNUMA: Crossbar switch replaces shared bus Application Software Advances: Implicit FEA:Sparse solvers increase performance by 10-fold Explicit FEA: Domain parallel increases performance by 10-fold CFD:Scalability increases performance by 100-fold Meshing:Automatic and robust “tetra” meshing

Characterization of CAE Applications CFD OVERFLOW FLUENT High STAR-CD Explicit FEA LS-DYNA PAM-CRASH Implicit FEA (Direct Freq) MSC.Nastran (108) RADIOSS Degree of Parallelism MARC Implicit FEA (Statics) ANSYS ADINA Implicit FEA (Modal Freq) ABAQUS MSC.Nastran (101) MSC.Nastran (103 and 111) Low 0.1 1 10 100 1000 Memory BW Cache-friendly Compute Intensity Flops/word of memory traffic

Characterization of CAE Applications CFD OVERFLOW FLUENT High MP SCALAR STAR-CD Explicit FEA LS-DYNA Implicit FEA (Direct Freq) PAM-CRASH MSC.Nastran (108) RADIOSS Degree of Parallelism VECTOR MARC Implicit FEA (Statics) ANSYS ADINA Implicit FEA (Modal Freq) ABAQUS MSC.Nastran (101) MSC.Nastran (103 and 111) Low 0.1 1 10 100 1000 Memory BW Cache-friendly Compute Intensity Flops/word of memory traffic

Scalability Emerging for all CAE Scalable CAE: Domain Decomposition Parallel Implicit FEA - ABAQUS, ANSYS,MSC.Marc,MSC.Nastran Explicit FEA - LS-DYNA,PAM-CRASH,RADIOSS General CFD - CFX, FLUENT,STAR-CD System CPU1 CPU2 CPU3 CPU4 Domain Parallel Example: Compressible 2D flow over wedge, partitioned as 4 domains for parallel execution on 4 processors 3 2 image 4 1

Parallel Scalability in CAE # CPUs 512 256 128 64 V70.7 32 16 8 4 2 1 SMP DMP 108 101 103 108 Usable parallel Peak parallel V70.5 CFD Codes Nastran Crash Codes

Considerations for Scalable CAE Sources that Inhibit Efficient Parallelism Source Solution Computational load imbalance communication overhead between neighboring partitions data and process placement message passing performance MPICH latency :~ 31ms Nearly equal sized partitions minimize communication between adjacent cells on different cpus enforce memory-process affinity latency and bandwidth awareness SGI-MPI3.1 latency :~ 12ms Scaling to 16p only Scaling to 64p !!

N N N R R R N N N N R N R R N N N N R R N N N N Considerations for Scalable CAE Processor-Memory Affinity (Data Placement) Theory: system will place data and execution threads together properly, system will migrate that data to follow the executing Real Life: 32p Origin 2000 Process migrates, data stays Process + Data

FLUENT Scalability on ccNUMA FLUENT Scalability Study of SSI vs. Cluster Software: FLUENT 5.1.1 CFD Model: External aerodynamics, 3D, k-e, segregated incompressible, iso-thermal,29M cells Time per Iteration (seconds) CPUs 10 30 60 120 240 SSI 381 1.0 99 3.9 67 5.7 29 13.1 18 21.2 4 x 64 424 1.0 139 3.0 72 5.9 39 10.9 49 8.7 Largest FLUENT automotive case achieved near ideal scaling on SGI 2800/256

SSI Advantage for CFD with MPI Single System Image (SSI) Latency 256cpu SSI CPUs 8 16 32 64 128 256 Shared Memory (ns) 528 641 710 796 903 1200 MPI (ns) 19 x 10^3 23 x 10^3 26 x 10^3 29 x 10^3 34 x 10^3 44 x 10^3 4 x 64 Cluster Cluster Configuration Latency HIPPI osBYPASS 139 x 10^3

Grand Scale HPC: NASA and Boeing Boeing Commercial Aircraft NASA Ames Research Center OVERFLOW Complete Boeing 747 Aerodynamics Simulation 75 60 GFLOPS, Oct 99 60 Problem: 35M Points 160 Zones 45 Largest model in NASA history, achieved 60Gflops on SGI 2800/512 with linear scaling Performance (GFLOP/s) 30 FY98 Milestone 15 C916/16 OVERFLOW Limit 0 0 128 256 384 512 Number of CPUs

Computational Requirements for MSC.Nastran Memory CPU Bandwidth Cycles 7% 93% 60% 40% 83% 17% 100% 0% Compute Task Sparse Direct Solver Lanczos Solver Iterative Solver I/O Activity

MSC.Nastran Scalability on ccNUMA MSC/NASTRAN MPI Based Scalability for SOL 103, 111: • Typical scalability - 2x to 3x on 8p, less for SOL 111 MSC/NASTRAN MPI Based Scalability for SOL 108: • Independent frequency steps, naturally parallel • File and memory space not shared • Near linear parallel scalability • Improved accuracy over SOL 111 with increasing frequency • Released on SGI with v70.7 (Oct 99)

MSC.Nastran Scalability on ccNUMA MSC/NASTRAN MPI Based Scalability for SOL 111: 0Hz 100Hz 200Hz 300Hz 400Hz 150 modes CPU 1 350 modes CPU 2 300 modes CPU 3 200 modes CPU 4 Modes CPU MSC/NASTRAN MPI Based Scalability for SOL 108: 0Hz 50Hz 100Hz 150Hz 200Hz Parallel Schematics Parallel Schemes for an excitation frequency of 200Hz on a 4 CPU system 1 - 50 CPU 1 51 - 100 CPU 2 101 - 150 CPU 3 151 - 200 CPU 4 Freqs CPU

MSC.Nastran Scalability on ccNUMA SOL 108 Comparison with Conventional NVH (SOL 111 on T90) Cray T90 Baseline Results SOL: 111 DOF:525K Eigensolution: 2714 modes Freq Steps: 96 Elap Time: 31610 sec CPUs Elapsed Parallel Time (s) Speed-up 1 120720 1.0 2 61680 2.0 4 32160 3.8 8 17387 6.9 16 10387 11.6(*) * measured on populated nodes

MSC.Nastran Scalability on ccNUMA The Future of Automotive NVH Modeling MSC.Nastran Parallel Scalability for Direct Frequency Response (SOL 108) Model Description Model: BIW SOL: 108 DOF:536K Freq Steps: 96 Run Statistics (per MPI Process) Memory: 340 MB FFIO Cache: 128 MB Disk Space: 3.6 GB Process/Node: 2 CPUs Elapsed Parallel Time (h) Speed-up 1 31.7 1.0 8 4.1 7.8 16 2.2 14.2 32 1.4 22.6

Future Automotive NVH Modeling Higher excitation frequencies of interest will increase DOF and modal density beyond SOL 103,111 practical limits Modal Frequency Response: 103,111 Elap Time Direct Frequency Response: 108 199X Models 200X Models Frequency

Economics of HPC Rapidly Changing SGI Partnership with HPC Community on Technology Roadmap UNICOS/ Vector Functionality Migration Capability Features IRIX/MIPS SSI Linux/IA-64, Clusters & SSI General Availability

HPC Architecture Roadmap at SGI SN-MIPS: Features of Next Generation ccNUMA • Bandwidth improvement of 2x over Origin2000 • System support for IRIX/MIPS or LINUX/IA-64 • Modular design allows subsystem upgrades without forklift • Latency decrease by 50% over Origin2000 • Next Generation IRIX Features and Improvements • Shared memory to 512 processors and beyond • RAS enhancements: Resiliency and Hot Swap • Data center management: scheduling, accounting • HPC clustering: GSN, CXFS shared file system

Characterization of CAE Applications CFD OVERFLOW SN-MIPS Benefit FLUENT High STAR-CD Explicit FEA LS-DYNA PAM-CRASH Implicit FEA (Direct Freq) MSC.Nastran (108) RADIOSS Degree of Parallelism MARC Implicit FEA (Statics) ANSYS ADINA Implicit FEA (Modal Freq) ABAQUS MSC.Nastran (101) MSC.Nastran (103 and 111) Low 0.1 1 10 100 1000 Memory BW Cache-friendly Compute Intensity Flops/word of memory traffic

Characterization of CAE Applications CFD OVERFLOW SN-MIPS Benefit FLUENT High STAR-CD Explicit FEA LS-DYNA SN-IA Benefit PAM-CRASH Implicit FEA (Direct Freq) MSC.Nastran (108) RADIOSS Degree of Parallelism MARC Implicit FEA (Statics) ANSYS ADINA Implicit FEA (Modal Freq) ABAQUS MSC.Nastran (101) MSC.Nastran (103 and 111) Low 0.1 1 10 100 1000 Memory BW Cache-friendly Compute Intensity Flops/word of memory traffic

Architecture Mix for Automotive HPC 1997: 1.1 TFlops installed in Automotive OEMs world wide 1999: 2.9 TFlops installed in Automotive OEMs world wide Current as of SEP 1999

Automotive Industry HPC Investments GM and DaimlerChrysler each grew capacity more than 2x over past year

Future Directions in CAE Applications Meta-Computing with Explicit FEA Non-deterministic methods for improved FEA simulation Los Alamos and DOE Applied Engineering Analysis “Stochastic Simulation of 18 CPU Years Completed in 3 Days on ASCI Blue Mtn” USDOE supported research achieved “first-ever” full-scale ABAQUS/Explicit simulation of nuclear weapons impact response on Origin/6144 ASCI (Feb 00) Ford Motor SRL and NASA Langley Optimization of a vehicle body for NVH and crash, completed 9 CPU months of RADIOSS and MSC.Nastran overnight with response surface technique (Apr 00) BMW Body Engineering 672 MIPS cpus dedicated to stochastic crash simulation with PAM-CRASH (Jan 00)

Meta-Computing with Explicit FEA Objective: • Manage design uncertainty from variability • Scatter in materials, loading, test conditions • Non-deterministic simulation of vehicle “population” • Meta-computing on SSI or large cluster • Improved design space exploration • Moving design towards target parameters Most likely Performance Approach: Unlikely Performance Insight:

Ford Motor Scientific Research Labs NASA Langley Research Center Grand Scale HPC: NASA and Ford NVH & Crash Optimization of Vehicle Body Overnight • Ford body-in-prime (BIP) model of 390K DOF • MSC.Nastran for NVH, 30 design variables • RADIOSS for crash, 20 design variables • 10 design variables in common • Sensitivity based Taylor approx. for NVH • Polynomial response surface for crash Achieved overnight BIP optimization on SGI 2800/256, with equivalent yield of 9 months CPU time

Historical Growth of CAE Application Growth Index 100 1993 1999 x90 X90+ x40 x6 x6 x6 x7 x5 1 Capacity GFlops Cost per CPU-hour NVH Model Size CFDModel Size Crash Model Size Number of Engineers Turnaround time Crash SMP Turnaround time Crash, CFD-MPP #1564Gflops 450000 elem. 2 Mil. DOF >10Mil cells Source: Survey of major automotive developers

Future Directions of Scalable CAE CAE to evolve into fully scalable, RISC-based technology High resolution models - CFD today, Crash, FEA emerging Deterministic CAE giving way to probability techniques Deployment increases computational requirements 10-fold Visual interaction with models beyond 3M cell/DOF High resolution modeling will strain visualization technology Multi-Discipline optimization (MDO) implementation in earnest Coupling of structure, fluids, acoustics, electromagnetics

Conclusions • For small and medium size problems cluster can be a viable solution in the range of 8 – 16 CPUs • In the space of large and extremely large problems SSI architecture provides better parallel performance due to superior characteristics of in-box interconnect • In order to increase a single CPU performance developer should put in consideration the correlation between exploited data structure & algorithms and specific memory hierarchy • ccNUMA system allows a coupling of various parallel programming paradigms which could benefit a performance of multiphysics applications

Considerations for Scalable CAE on the SGI ccNUMA Architecture

Considerations for Scalable CAE on the SGI ccNUMA Architecture

Presentation Transcript

ccNUMA

Scalable Processor Architecture (SPARC)

Scalable CC-NUMA Design Study - SGI Origin 2000

ccNUMA

SGI Altix ICE Architecture Rev. 1.2a Kevin Nolte, SGI, Professional Services

A Scalable Internet Architecture

Privacy Architecture Considerations

RRM Architecture Considerations

Parallel Programming on the SGI Origin2000

Privacy Architecture Considerations

A Scalable Architecture for LDPC Decoding

Privacy Architecture Considerations

Web Architecture considerations

Parallel Programming on the SGI Origin2000

Scalable Reliable Multicast Architecture

WLAN Architecture - Considerations

SGI

Scalable JavaScript Application Architecture

RRM Architecture Considerations

Scalable CC-NUMA Design Study - SGI Origin 2000