Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python

Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python Cyrus Harrison, Lawrence Livermore National Laboratory Paul Navrátil, Texas Advanced Computing Center, Univ of Texas at Austin MaysamMoussalem, Department of Computer Science, Univ of Texas at Austin Ming Jiang, Lawrence Livermore National Laboratory Hank Childs, Lawrence Berkeley National Laboratory PyHPC 2012 Workshop Friday Nov 16, 2012

Outline • Motivation • System Architecture • Framework Components • Execution Strategies • Evaluation Methodology • Evaluation Results

Motivation

Motivation Our Story… is a Python-fueled HPC research success story. • Our goal: • Start to address uncertainty with future HPC hardware architectures and programing models. • This work: • Explores moving a key visualization and analysis capability to many-core architectures. • Why Python? • Productivity + powerful tools (PLY, NumPy, PyOpenCL) The Python ecosystem provided a productive and flexible foundation for this research.

Motivation What is “Derived Field Generation”? • Creating new fields from existing fields in simulation data. • A critical component of scientific visualization and analysis tool suites. • Example Expressions:

Motivation Derived Field GenerationFrameworks • Are present in many post-processing tools: • Paraview, VisIt, etc. • Include three key components: • A set of primitives that can be used to create derived quantities. • An interface which allows users to compose these primitives. • A mechanism which transforms and executes the composed primitives. • Ongoing issues: • Lack of flexibility to exploit many-core architectures • Inefficiency in executing composed primitives Success on future many-core architectures will require us to re-think our existing derived field generation frameworks.

Motivation We have developed a new Python-based derived field generation framework. Unique Contributions: • First-ever implementation targeting many-core architectures. • A flexible Python infrastructure that enables the design and testing of a wide range of execution strategies. • An evaluation exploring the tradeoffs between runtime performance and memory constraints. In this presentation we present the details of our framework and the results of our evaluation studies.

System Architecture

System Architecture System Architecture Framework Components • Host Application Interface • Our framework is designed to work in-situ for codes with a NumPy interface to mesh data fields. • Ndarrays are used as the input/output data interface. • PLY-based front-end Parser • Transforms user expressions into a dataflow speciation. • Dataflow Network Module • Coordinates OpenCLexecution using PyOpenCL. • Designed to support multiple execution strategies. The Python Dataflow module is the core of our framework.

System Architecture System Architecture Diagram Expression Parser Host Application User Expressions PLY Python Dataflow Network Execution Strategies Data OpenCL Target Device(s) PyOpenCL

System Architecture Python Dataflow Module Overview Basic Features • Simple “create and connect” API for network definition. • The API used by the parser front-end is usable by humans. • Execution is decoupled from network definition and traversal: • A Topological sort is used to ensure precedence. • Results are managed by a reference-counting registry. • Straight forward filter API is used to implement derived field primitives. • Network structure can be visualized using graphviz.

System Architecture Python Dataflow Module Overview OpenCLEnvironment • Built using PyOpenCL • Records and categorizes OpenCL timing events: • OpenCL Host-to-device Transfers (Inputs) • OpenCL Kernel Executions • OpenCL Device-to-host Transfers (Results) • Manages OpenCL Device buffers: • Tracks allocated device buffers, available global device memory, and global memory high-water mark. • Enables reuse of allocated buffers.

System Architecture Python Dataflow Module Overview Execution Strategies • Control data movement and how the OpenCL kernels of each primitive are composed to compute the final result. • Implementations leverage the features of our dataflow network module: • Precedence from the dataflow graph • Reference counting for intermediate results • OpenCL kernels for the primitives are written once and used by all strategies.

System Architecture For this work we implemented and studied three execution strategies. Roundtrip: • Dispatches a single kernel for each primitive. • Transfers each intermediate result from OpenCLtarget deviceback to the host environment. Staged: • Dispatches a single kernel for each primitive. • Stores intermediate results in the global memory of the OpenCL target device. Fusion: • Employs kernel fusion to construct and execute a single OpenCL kernel that composes all selected primitives.

System Architecture To demonstrate our execution strategies we use a simple example expression. mult mult mult y z z x x y mag= sqrt(x*x+y*y+z*z) add Example Expression add mag sqrt Corresponding Dataflow Network

System Architecture RoundtripStrategy Execution Example: mag = sqrt(x*x+y*y+z*z) • OpenCLHost OpenCL Target • x f1= mult(x,x) • f1 • y mult mult mult f2= mult(y,y) mult mult mult y x x z z y • f2 add • z add f3= mult(z,z) • f3 add mag • f1 • f2 add f4= add(f1,f2) sqrt • f4 sqrt • f4 • f3 f5= add(f4,f3) • f5 mag • f5 x y z x y z f6= sqrt(f5) • <result>

System Architecture Staged Strategy Execution Example: mag = sqrt(x*x+y*y+z*z) • OpenCLHost OpenCL Target • x f1= mult(x,x) mult mult mult • y mult mult mult f2= mult(y,y) x x z y z y add • z add f3= mult(z,z) add mag f4= add(f1,f2) add f5= add(f4,f3) sqrt sqrt f6= sqrt(f5) • <result> mag x y z x y z

System Architecture Fusion Strategy Execution Example: mag = sqrt(x*x+y*y+z*z) • OpenCLHost OpenCL Target mult mult mult mult mult mult y x x z y z • y • z • x add f1= mult(x,x) f2 = mult(y,y) f3 = mult(z,z) f4 = add(f1,f2) f5 = add(f4,f3) f6 = sqrt(f5) add add mag add sqrt • <result> sqrt mag x y z x y z

System Architecture These execution strategies have varying memory constraints. x3 x4 x3 x3 x3 x3 x5 Roundtrip Staged Fusion The flexibility to explore tradeoffs of several strategies is important to success on many-core architectures.

Evaluation Methodology

Evaluation Methodology Evaluation Overview • Evaluation Expressions: • Detection of vortical structures in a turbulent mixing simulation. • Host Application: VisIt • Three Studies: • Single Device Performance • Single Device Memory Usage • Distributed-Memory Parallel • Test Environment: LLNL’s Edge HPC Cluster • Provides OpenCL access to both NVIDIA Tesla M2050s and Intel Xeon processors.

Evaluation Methodology Evaluation Expressions • We selected three expressions used for vortex detection and analysis. Vector Magnitude: v_mag= sqrt(u*u + v*v + w*w) Vorticity Magnitude: du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw= grad3d(w,dims,x,y,z) w_x= dw[1] - dv[2] w_y = du[2] - dw[0] w_z= dv[0] - du[1] w_mag = sqrt(w_x*w_x + w_y*w_y + w_z*w_z) Q-criterion: These expressions vary in complexity and memory usage.

Evaluation Methodology Evaluation Expressions: Q-Criterion du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw = grad3d(w,dims,x,y,z) s_1 = 0.5 * (du[1] + dv[0]) s_2 = 0.5 * (du[2] + dw[0]) s_3 = 0.5 * (dv[0] + du[1]) s_5 = 0.5 * (dv[2] + dw[1]) s_6 = 0.5 * (dw[0] + du[2]) s_7 = 0.5 * (dw[1] + dv[2]) w_1 = 0.5 * (du[1] - dv[0]) w_2 = 0.5 * (du[2] - dw[0]) w_3 = 0.5 * (dv[0] - du[1]) w_5 = 0.5 * (dv[2] - dw[1]) w_6 = 0.5 * (dw[0] - du[2]) w_7 = 0.5 * (dw[1] - dv[2]) s_norm = du[0]*du[0] + s_1*s_1 + s_2*s_2 + s_3*s_3 + dv[1]*dv[1] + s_5*s_5 + s_6*s_6 + s_7*s_7 + dw[2]*dw[2] w_norm = w_1*w_1 + w_2*w_2 + w_3*w_3 + w_5*w_5 + w_6*w_6 + w_7*w_7 q_crit = 0.5 * (w_norm - s_norm) The primitives add, mult, sqrt, grad3d, and vector decomposition are sufficient to build this complex expression.

Evaluation Methodology Evaluation Data: Turbulent Mix Simulation • A 30723 timestepof a Rayleigh–Taylor instability simulation. • DNS simulation with intricate embedded vortical features. 3072 sub-grids (each 192x129x256 cells) 27 billion cells • Data courtesy of Bill Cabot and Andy Cook, LLNL

Evaluation Methodology Evaluation Data: Single Device Test Grids • 12 sub-grids varying from 9.3 to 113.2 million cells. • Fields: Mesh coords (x,y,z) Velocity vector field (u,v,w) • Sub-grids for single device evaluation Velocity Magnitude • Data courtesy of Bill Cabot and Andy Cook, LLNL

Evaluation Methodology We evaluated our framework in-situ using VisIt’sPython Expression Filter Runtime. VisIt’s Python Interfaces Local Components Parallel Cluster Viewer (State Manager) Data Data Compute Engine MPI Data network connection • Python Client Interface • (State Control) • Python Filter Runtime • (Direct Mesh Manipulation) GUI CLI Python Clients Using VisIt allowed us to evaluate our framework in both single node and distributed-memory parallel contexts.

Evaluation Methodology Evaluation Studies Single Device Evaluation • Recorded runtime performance and memory usage • Two OpenCLTarget devices: • GPU: Tesla M2050 (3GBRAM) • CPU: Intel Xeons (96GBRAM, shared with host environment) • 144 test cases per device: • Three test expressions • Our three strategies and a reference kernel • Data: 12 RT3D sub-grids • Sizes range from 9.6 million to 113 million cells.

Evaluation Methodology Evaluation Studies Distributed Memory Parallel Test • A “Smoke” test • Q-criterion using the Fusionstrategy • 128 nodes using two M2050 Teslasper node • Data: Full mesh from a single RT3D timestep • 3072 sub-grids each with 192x192x256 cells • 27 billon total cells + ghost data • Each of the 256 Teslasstream 12 sub-grids.

Evaluation Results

Evaluation Results Single Device Runtime Performance Velocity Magnitude

Evaluation Results Single Device Runtime Performance VorticityMagnitude

Evaluation Results Single Device Runtime Performance Q-criterion

Evaluation Results Single Device Memory Usage Results Velocity Magnitude

Evaluation Results Single Device Memory Usage Results VorticityMagnitude

Evaluation Results Single Device Memory Usage Results Q-criterion

Evaluation Results Our framework recorded the number of OpenCL target device events.

Evaluation Results The distributed-memory parallel “Smoke” test was successful. Q-criterionof 27 billion cell mesh Each of the 256 Teslassuccessfully processed 12 sub-grids. 0.00 0.25 0.50 0.75 1.00

Evaluation Results Discussion Strategy Comparison: • Roundtrip:Slowest and least constrained by target device memory. • Staged:Faster than Roundtripand most constrained by target device memory. • Fusion:Fastest and least amount of data movement. Device Comparison: • GPU: Best runtime performance for test cases that fit into the 3GB of global device memory. • CPU: Successfully completed all test cases. Our evaluation shows the benefits of supporting multiple execution strategies on multiple types of target devices

Conclusion • Our framework provides flexible path forward for exploring strategies for efficient derived field generation on future many-core architectures. • The Python ecosystem made this research possible. • Future work: • Distributed-memory parallel performance • Strategies for streaming and using multiple devices on-node • Thanks PyHPC 2012! Contact Info: Cyrus Harrison <cyrush@llnl.gov>

Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python