The Application of POSIX Threads And OpenMP to the U.S. NRC Neutron Kinetics Code PARCS

The Application of POSIX Threads And OpenMP to the U.S. NRC Neutron Kinetics Code PARCS D.J. Lee and T.J. Downar School of Nuclear Engineering Purdue University July, 2001

Contents • Introduction • Parallelism in PARCS • Parallel Performance of PARCS • Cache Analysis • Conclusions

Introduction

PARCS • “Purdue Advanced Reactor Core Simulator” • U.S. NRC(Nuclear Regulatory Commission) Code for Nuclear Reactor Safety Analysis • Developed at School of Nuclear Engineering of Purdue University • A Multi-Dimensional Multi-Group Reactor Kinetics Code Based on Nonlinear Nodal Method

Nuclear Power Plant Nuclear Reactor Core

Equations Solved in PARCS • Time-Dependent Boltzmann Transport Equation • T/H Field Equations • Heat Conduction Equation • Heat Convection Equation

Thermal-Hydraulics: Computes new coolant/fuel properties Sends moderator temp., vapor and liquid densities, void fraction, boron conc., and average, centerline, and surface fuel temp. Uses neutronic power as heat source for conduction Neutronics: Uses coolant and fuel properties for local node conditions Updates macroscopic cross sections based on local node conditions Computes 3-D flux Sends node-wise power distribution Spatial Coupling

High Necessity of HPC for PARCS • Acceleration Techniques in PARCS • Nonlinear CMFD Method : Global(Low Order)+Local(High Order) • BILU3D Preconditioned BICGSTAB • Wielandt Shift Method • Still, Computational Burden of PARCS is Very Large • Typically, The Calculation Speed is More Than an Order of Magnitude Slower Than Real Time • Example • NEACRP Benchmark Several Tens of Seconds for 0.5 sec. Simulation • PARCS/TRAC Coupled RUN 4 Hours for 100 sec. Simulation

Parallelism In PARCS

PARCS Computational Modules • CMFD: Solves the “Global” Coarse Mesh Finite Difference Equation • NODAL: Solves “Local” Higher Order Differenced Equations • XSEC: Provides Temperature/Fluid Feedback through Cross Sections (Coefficients of Boltzmann Equation) • T/H: Solution of Temperature/Fluid Field Equations

Parallelism in PARCS • NODAL and Xsec Module: • Node by Node Calculation • Naturally Parallelizable • T/H Module: • Channel by Channel Calculation • Naturally Parallelizable • CMFD Module: • Domain Decomposition Preconditioning • Example: Split the Reactor into Two Halves • The Number of Iteration Depends on the Number of Domains

Why Multi-Threaded Programming ? • Coupling of Domains • The Information of One Plane at the Interface of Two Domains Should Be Transferred to Each Other • The Size of Information to be Exchanged is NOT SMALL Compared with the Amount of Calculations for Each Domain • Message Passing • Large Communication Overhead  • Multi-Threading • Shared Address Space • Negligible Communication Overhead 

Multi-threaded Programming • OpenMP • FORTRAN, C, C++ • Simple Implementation based on Directives • POSIX Threads • No Interface to FORTRAN • Developed FORTRAN-to-C Wrapper • Much Caution Required to Avoid Race Conditions

POSIX THREADS WITH FORTRAN: nuc_threads • Mixed language interface accessible to both Fortran and C sections of the code • Minimal set of threads functions: • nuc_init(*ncpu): initializes mutex and condition variables. • nuc_frk(*func_name,*nuc_arg,*arg): creates the POSIX threads. • nuc_bar(*iam): used for synchronization. • nuc_gsum(*iam,*A,*globsum): used to get a global sum of an array updated by each thread.

Implementation of OpenMP and Pthreads Begin Begin Fork Fork Thread 1 Thread 2 Thread 1 Thread 2 Join Synchronization OpenMP idle Pthreads Fork Thread 1 Thread 2 Synchronization Join Join End End

Parallel Performance of PARCS

Applications • Matrix Vector Multiplication • Subroutine “MatVec” of PARCS • Size of Matrix Is Same As NEACRP Benchmark • NEACRP Reactor Transient Benchmark • Control Rod Ejection From Hot Zero Power Condition • Full 3-Dimensional Transient

Specification of Machine Platform SUN ULTRA-80 SGI ORIGIN 2000 Number of CPUs 2 32 CPU Type ULTRA SPARC II 450 MHz MIPS R10000 250 MHz 4-way superscalar L1 Cache 16 KB D-cache 16 KB I-cache Cache Line Size : 32bytes 32 KB D-cache 32 KB I-cache Cache Line Size : 32bytes L2 Cache 4MB 4MB per CPU Cache Line Size : 128bytes Main Memory 1GB 16GB Compiler SUN Workshop 6 -FORTRAN 90 6.1 MIPSpro Compiler 7.2.1 - FORTRAN 90

Specification of Machine Platform LINUX Machine Number of CPUs 4 CPU Type Intel Pentium-III 550 MHz L1 Cache 16 KB D-cache 16 KB I-cache Cache Line Size : ? bytes L2 Cache 512KB Main Memory 1GB Compiler NAGWare FORTRAN 90 Version 4.2 ftp://download.intel.com/design/PentiumIII/xeon/datashts/24509402.pdfSlot 2 technology, 100MHz bus, non-blocking cache

23.43 13.26 - - 3.71 *2) 1.93 - - (0.16) (0.28) - - (1.02) *3) (1.95) - - 1.73 0.92 0.52 0.37 1.72 1.80 1.91 1.96 (1.00) (1.89) (3.30)*4) (4.72) (1.01) (0.96) (0.91) (0.88) Matrix-Vector Multiplication(MatVec Subroutine of PARCS) *1) Number of Threads *4) Core is Divided into 18 Planes *2) Time(seconds) *3) Speedup Machine Serial OpenMP Pthreads 1*1) 2 4 8 1 2 4 8 SUN 3.76 SGI 1.73

SUN SGI Serial Run Time: 3.76 s Serial Run Time: 1.73 s Matrix-Vector Multiplication(Subroutine of PARCS)

threads NEACRP Benchmark(Simulation with Multiple Threads)

2 Speedup 20.8 1.77 6.4 1.78 14.5 2.04 3.7 2.04 45.5 1.88 456 - 33 - 216 - 226 - Parallel Performance (SUN) *) Number of Threads Module Serial Pthreads 1*) Time (sec) CMFD 36.7 32.1 Nodal 11.5 11.3 T/H 29.6 27.9 Xsec 7.6 7.1 Total 85.4 78.5 # of Updates CMFD 445 445 Nodal 31 31 T/H 216 216 Xsec 225 225

2 Speedup 4 Speedup 8 Speedup 12.1 1.63 8.93 2.21 8.85 2.23 5.8 1.55 3.56 2.53 2.87 3.14 12.3 2.17 8.92 2.99 7.14 3.73 2.4 2.01 1.37 3.53 1.11 4.35 32.6 1.85 22.8 2.64*2) 20.0 3.02*2) 456 - 497 - 565 - 33 - 38 - 39 - 216 - 216 - 217 - 226 - 228 - 227 - Parallel Performance (SGI) *1) Number of Threads *2) Core is divided into 18 planes Module Serial OpenMP 1 *1) Time (sec) CMFD 19.8 19.3 Nodal 9.0 9.2 T/H 26.6 25.3 Xsec 4.8 4.4 Total 60.2 58.1 # of Updates CMFD 445 445 Nodal 31 31 T/H 216 216 Xsec 225 225

Cache Analysis

Memory Access Type Cycles L1 cache hit 2 L1 cache miss satisfied by L2 cache hit 8 L2 cache miss satisfied from memory 75 Memory Access Time CPU Typical Memory Access Cycles (SGI) L1 Cache L2 Cache Memory

Cache Miss Measurements (SGI) Module Cache Serial OpenMP *1) Number of Threads 1*1) 2 4 8 CMFD (BICG) L1 477,691 479,474 258,027 156,461 105,733 L2 28,242 29,650 17,007 11,751 9,309 Nodal L1 857,744 853,866 444,849 249,507 160,699 L2 54,163 55,534 33,846 19,016 12,848 T/H (TRTH) L1 165,133 60,587 39,419 25,850 19,816 L2 9,551 9,512 9,673 6,451 4,620 XSEC L1 62,324 57,462 29,845 17,715 11,344 L2 9,456 9,518 5,517 3,737 2,578

Cache Miss & Speedup of XSEC Module (SGI)

1.00 1.85 3.05 4.52 0.95 1.66 2.40 3.03 1.00 1.93 3.44 5.34 0.98 1.60 2.85 4.22 2.73 4.19 6.39 8.33 1.00 0.99 1.48 2.07 1.08 2.09 3.52 5.49 0.99 1.71 2.53 3.67 Cache Miss Ratio (SGI) Module Cache Serial OpenMP *1) Number of Threads 1*1) 2 4 8 CMFD (BICG) L1 1.00 L2 1.00 Nodal L1 1.00 L2 1.00 T/H (TRTH) L1 1.00 L2 1.00 XSEC L1 1.00 L2 1.00 Cache Miss Ratio =

Speedup where = Total data access time for serial execution = Total data access time for 2 threads execution. • Data Access Time where = Total L2 cache access time = Total memory access time = Number of L1 data cache misses satisfied by L2 cache hit = Number of L2 data cache misses satisfied from main memory = L2 cache access time for 1 word = Main memory access time for 1 word. Speedup Estimation Using Cache Misses

Estimated 2-thread Speedup Based on Data Cache Misses for OpenMP on SGI Module Speedup Measured Predicted CMFD (BICG) 1.63 1.78 Nodal 1.55 1.80 T/H (TRTH) 2.17 2.04 XSEC 2.01 1.86

Conclusions

Conclusions • Comparison of OpenMP and POSIX Threads • OpenMP is comparable to POSIX Threads in terms of Parallel Performance • OpenMP is much easier to Implement than POSIX Threads due to the Directive based Nature • Cache Analysis • The Prediction of Speedup based on Data Cache Misses Agrees well with the Measured Speedup

Continuing Work • Algorithmic • 3-D Domain Decomposition • Software • SUN Compiler • Pthreads Scheduling on SGI • Alternate Platforms

The Application of POSIX Threads And OpenMP to the U.S. NRC Neutron Kinetics Code PARCS

The Application of POSIX Threads And OpenMP to the U.S. NRC Neutron Kinetics Code PARCS

Presentation Transcript

Programming with Posix Threads

POSIX Threads Programming

POSIX Threads

MPI vs POSIX Threads

FRAPCON/FRAPTRAN Code Application NRC Office of Research

The U.S. Code

Programming with POSIX* Threads

POSIX Threads

MPI vs POSIX Threads

POSIX threads and C++ facilities

Lecture 3 Posix Threads

Shared Memory Programming OpenMP and Threads

POSIX Threads

Introduction to MPI, OpenMP, Threads

CS162B: POSIX Threads

Pthread and POSIX.1c Threads

POSIX Threads Programming

Lecture 7: POSIX Threads - Pthreads

POSIX Threads

Posix Threads

Shared Memory Programming: Threads and OpenMP

Lecture 3 Posix Threads