Outline

Outline • Code Overview • Parallel Model (PVM) for TRAC-M • Parallel Model (OpenMP) for PARCS • Conclusions

Nuclear Power Plant

TRAC-M Overview • “TRAC”: The U.S. Nuclear Regulatory Commission Transient Reactor Analysis Code • Advanced best-estimate predictions of postulated accidents in light-water reactors • Solves the two-fluid mass, energy, and momentum equations • A Consolidated Thermal-hydraulic Code for Nuclear Reactor Safety Analysis, including the capabilities of older U.S. NRC codes: • RELAP and TRAC-P for PWR • RAMONA and TRAC-B for BWR

Code Features • Objectives: • Readability, Maintainability, Extensibility, and Portability • Fortran 90 code, over 150,000 lines • Modeling capability: • 3-D pressure vessel • Two-fluid nonequilibrium hydrodynamics model with a noncondensable gas field and solute tracking • Flow-regime-dependent constitutive-equation treatment • Consistent treatment of entire accident sequences • The stability-enhancing two-step numerical algorithm is adopted • Parallel option to extend code functionality and to improve execution speed: • 3-D neutron kinetics capabilities provided by coupling to an advanced 3-D kinetics code (PARCS)

TRAC-M Multitask Model • Using multitask model to extend code functionality and improve execution speed:

Exterior Communication Interface • Interprocess communication implemented by the ECI (Exterior Communications Interface) • Table driven data transfer • Step 2 need an interprocess message passing library – currently using PVM

Why Choose PVM? • For modularity we treat each module as a task and use a central task to control all the modules • The virtual machine concept makes it possible to realize the heterogeneous distributed computing • Compared with MPI, • PVM provides more flexibility on process control • PVM provides dynamic resource control • PVM provides full fault tolerance ability

Parallel Applications of TRAC-M • Sample Problems: • 2-Pipe Test Problem • AP600 LBLOCA • Pressurized Thermal Shock (PTS) Problem • Performance Comparison on Various Platforms: • Xeon PII (dual CPU: 400MHz; Windows 2000); • Xeon PIII (4-CPU: 550MHz; Linux, RedHat); • Xeon PIII (dual CPU: 800MHz; Linux, RedHat); • DEC Alpha 8400 (Digital Unix);

pipe 2 TASK B TASK A pipe 1 1) Pipe Test Problem • 2-pipe model:

Case Task Tcalt Tcomm_t Tidle Tot_t Speedup Efficiency C (100:500) 1 12.040 4.070 46.37 16.110 1.014 50.7% 2 58.710 3.770 - 62.480 B (200:400) 1 23.640 3.770 22.38 27.410 1.273 63.6% 2 46.100 3.690 - 49.790 A (300:300) 1 34.450 4.250 - 38.700 1.650 82.5% 2 34.110 3.990 - 38.100 Load balance study • Total 600 cells: 1000 time steps running on Dual CPU PC ( 800MHz; Linux) serial runtime: 63.370 sec

Parallel Runtime Analytic Model • For Task i: Tpi = Tcalti + Tcomm_ti + Tidlei (1) Calculation time: Tcalti Tcalti= fmytask * Ts where, fmytask: fraction of total work load Ts : serial running time (2) Communication time: Tcomm_ti Tcomm_ti= ts + tw * m, for each message passed where, ts: latency tw: bandwidth m: message size (3) Idle time: Tidlei Tidlei = [max(fitask) - fmytask] * Ts

cells in Task-A 100 200 300 400 500 fraction: fA 1/6 2/6 3/6 4/6 5/6 calt_time 10.562 21.123 31.685 42.247 52.808 • Communication time: Tcomm_t • max(m) = 592Bytes, Teach = ts + tw*m  80 s (latency dominated) • 16 synchronization points, two of them are used only once for initialization. • message passes twice in each synchronization points: one for data transfer, the other for status check • Tcomm_t  2 * 2 * 14 * 1000 * 80 = 4.48E+6 s = 4.48 sec Analysis of the Execution time • Calculation time: Tcalt Serial runtime: 63.370 sec

cells in Task-A pipe cells in Task-A pipe Tcalt 100 Tcomm_t 200 Tidle Tot_t 300 Eff (Ts/(Tp*p)) 400 Eff(test) 500 fA 1/6 2/6 3/6 4/6 5/6 100 10.562 4.48 42.247 57.289 55.3% 50.7% Max(fitask) in Tasks 5/6 4/6 3/6 4/6 5/6 200 21.123 4.48 21.123 46.726 67.8% 63.6% Idle_time 42.247 21.123 0 0 0 300 31.685 4.48 0 36.165 87.6% 82.5% • Analytical result: Analysis of the Execution time (Cont.) • Idle time: Tidle Serial runtime: 63.370 sec

2) Practical Reactor Problem: AP600 • AP600: an advanced light water reactor designed by Westing House • LBLOCA: a Large Break Loss of Coolant Accident Analysis - Sigificant computational burden in the overall assessment of the reactor plant safety

Domain Decompositions • Three domain decomposistions: • Model A – two tasks model: (1) reactor vessel (2) 1D loops • Model B – three tasks model: (1) core vessel (2) downcomer vessel (3) 1D loops • Model C – two tasks model: (1) core vessel (2) downcomer vessel + 1D loops

Load Balance vs. Performance • Load Distributions and Performance Comparison for Three Domain Decompositions 200 time steps running on DEC Alpha 8400 serial runtime: 145.498 sec

3) Performance on Various Platforms • Performance Comparison on Various Platforms: • Xeon PII (dual CPU: 400MHz; Windows 2000); • Xeon PIII (4-CPU: 550MHz; Linux, RedHat); • Xeon PIII (dual CPU: 800MHz; Linux, RedHat); • DEC Alpha 8400 (Digital Unix). • Test Problems: • Pressurized Thermal Shock (PTS) with a 3D reactor vessel • Only considers downcomer (3D vessel) and 4 cold legs

Latency of PVM Protocols (s): Protocols DEC Alpha PC (400Mhz, Win2K) PC (550Mhz, Linux) PC (800Mhz, Linux) psend & precv 32.157 343.750 51.000 74.875 pack/send & recv/unpack • Bandwidth: 33.672 304.688 26.667 69.000 Message Passing Library: PVM • PVM Protocols Implemented in TRAC-M • psend & precv for bulk data transfer; • pack/send & recv/unpack for status checking.

Platform Serial Task Tcalt Tcomm_t Tidle_t Tot_t (CPU) Speedup Efficiency Xeon PIII Dual-CPU Linux 80.300 1 46.830 0.270 - 47.100 1.705 85.2% 2 38.940 0.330 7.83 39.270 Xeon PIII Four-CPU Linux 116.79 1 67.350 0.440 - 67.790 1.723 86.1% 2 55.850 0.530 11.41 56.380 DEC Alpha UNIX 81.732 1 18.727 1.720 47.886 20.447 1.206 60.3% 2 66.035 1.735 - 67.770 Xeon PII Dual-CPU WIN2000 76.594 1 30.047 17.719 12.968 47.766 1.261 63.1% 2 48.656 12.078 20.547 60.734 Performance Summary • Performance vs platform, which is determined by • Load balance; • Communication cost.

Further Developments:OpenMP Parallel Model for Neutron Kinetics

TRAC-M Functionality Extension • Another Objective of Parallel Option - Coupling with detailed modeling codes

Thermal-Hydraulics: Computes new coolant/fuel properties Sends moderator temp., vapor and liquid densities, void fraction, boron conc., and average, centerline, and surface fuel temp. Uses neutronic power as heat source for conduction Spatial Coupling • Neutronics: • Uses coolant and fuel properties for local node conditions • Updates macroscopic cross sections based on local node conditions • Computes 3-D flux • Sends node-wise power distribution

PARCS • “Purdue Advanced Reactor Core Simulator” • U.S. NRC Code for Nuclear Reactor Kinetics Analysis • A Multi-Dimensional Multi-Group Reactor Kinetics Code Based on Nonlinear Nodal Method • Equation Solved: • Time-Dependent Boltzmann Transport Equation

PARCS Computational Modules • CMFD: Solves the “Global” Coarse Mesh Finite Difference Equation • NODAL: Solves “Local” Higher Order Differenced Equations • XSEC: Provides Temperature/Fluid Feedback through Cross Sections (Coefficients of Boltzmann Equation) • T/H: Solution of Temperature/Fluid Field Equations

Parallelism in PARCS • NODAL and Xsec Module: • Node by Node Calculation • Naturally Parallelizable • T/H Module: • Channel by Channel Calculation • Naturally Parallelizable • CMFD Module: • Domain Decomposition Preconditioning • Example: Split the Reactor into Two Halves • The Number of Iteration Depends on the Number of Domains

Parallel Application of PARCS • Why Multi-Threaded Programming • Message Passing • Large Communication Overhead  • Multi-Threading • Shared Address Space • Negligible Communication Overhead  • Implementation • OpenMP • FORTRAN, C, C++ • Simple Implementation based on Directives • NEACRP Reactor Transient Benchmark • Control Rod Ejection From Hot Zero Power Condition • Full 3-Dimensional Transient • Platform • SGI ORIGIN 2000

Module Serial OpenMP 1 *1) 2 Speedup 4 Speedup 8 Speedup Time (sec) CMFD 19.8 19.3 12.1 1.63 8.93 2.21 8.85 2.23 Nodal 9.0 9.2 5.8 1.55 3.56 2.53 2.87 3.14 T/H 26.6 25.3 12.3 2.17 8.92 2.99 7.14 3.73 Xsec 4.8 4.4 2.4 2.01 1.37 3.53 1.11 4.35 Total 60.2 58.1 32.6 1.85 22.8 2.64*2) 20.0 3.02*2) *1) Number of Threads *2) Core is divided into 18 planes Parallel Performance (SGI)

Memory Access Type Cycles L1 cache hit 2 L1 cache miss satisfied by L2 cache hit 8 L2 cache miss satisfied from memory 75 Cache Analysis CPU Typical Memory Access Cycles (SGI) L1 Cache L2 Cache Memory

1.00 1.85 0.95 1.66 1.00 1.93 0.98 1.60 2.73 4.19 1.00 0.99 1.08 2.09 0.99 1.71 Cache Miss Ratio (SGI) Module Cache Serial OpenMP *1) Number of Threads 1*1) 2 CMFD (BICG) L1 1.00 L2 1.00 Nodal L1 1.00 L2 1.00 T/H (TRTH) L1 1.00 L2 1.00 XSEC L1 1.00 L2 1.00 Cache Miss Ratio =

Speedup where = Total data access time for serial execution = Total data access time for 2 threads execution. • Data Access Time where = Total L2 cache access time = Total memory access time = Number of L1 data cache misses satisfied by L2 cache hit = Number of L2 data cache misses satisfied from main memory = L2 cache access time for 1 word = Main memory access time for 1 word. Speedup Estimation Using Cache Misses

Estimated 2-thread Speedup Based on Data Cache Misses for OpenMP on SGI Module Speedup Measured Predicted CMFD (BICG) 1.63 1.78 Nodal 1.55 1.80 T/H (TRTH) 2.17 2.04 XSEC 2.01 1.86

Conclusions • ECI provides a flexible and efficient parallel capability for TRAC-M • Parallel performance is platform dependent and most stable and efficient on Linux and Unix platforms • The Prediction of speedup based on data cache misses agrees well with the measured speedup • The key to achieving good parallel performance: Balance the load!

THANK YOU !

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

OUTLINE