TESTING AND EXPOSING WEAK GPU MEMORY MODELS

TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014

Joint Work with: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), Alastair Donaldson, John Wickerson,(Imperial College London), Mark Batty (University of Cambridge)

Roadmap • Background and Approach • Prior Work • Testing Framework • Results • CUDA Spin Locks • Bulk Testing • Future Work and Conclusion

GPU Background • GPU is a highly parallel co-processor • Currently found in devicesfrom tablets to top supercomputers (Titan) • Not just used for visualization anymore! Images from Wikipedia [16,17,18]

GPU Programming Explicit Hierarchical concurrency model • Thread Hierarchy: • Thread • Warp • CTA (Cooperative Thread Array) • Kernel (GPU program) • Memory Hierarchy: • Shared Memory • Global Memory

GPU Programming

GPU Programming • GPUs are SIMT (Single Instruction, Multiple Thread) • NVIDIA GPUs may be programmed using CUDA or OpenCL

GPU Programming

Weak Memory Models • Consider the test known as Store Buffering (SB)

Weak Memory Models • Consider the test known as Store Buffering (SB) • Initial State: x and y are memory locations

Weak Memory Models • Consider the test known as Store Buffering (SB) • Thread IDs

Weak Memory Models • Consider the test known as Store Buffering (SB) • Program: for each thread ID

Weak Memory Models • Consider the test known as Store Buffering (SB) • Assertion: question about the final state of registers

Weak Memory Models • Consider the test known as Store Buffering (SB) • Can this assertion be satisfied?

Assertion cannot be satisfied by interleavings This is known as sequential consistency (or SC) [1]

Weak Memory Models • Can we assume assertion will never pass?

Weak Memory Models • Can we assume assertion will never pass? No!

Weak Memory Models • Executing this test with the Litmus tool [2] on an Intel i7 x86 processor for 1000000 iterations, we get the following histogram of results:

Weak Memory Models • What Happened? • Architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. • On x86 architectures, the hardware is allowed to re-order write instructions with program-order later read instructions [3]

GPU Memory Models • What type of memory model do current GPUs implement? • Documentation is sparse • CUDA has 1 page + 1 example [4] • PTX has 1 page + 0 examples [5] • No specifics about which instructions are allowed to be re-ordered • We need to know if we are to write correct GPU programs!

Our Approach • Empirically explore the memory model implemented on deployed NVIDIA GPUs • Achieved by developing a memory model testing tool for NVIDIA GPUs with specialized heuristics • We analyze classic memory model properties and CUDA applications in this framework with unexpected results • We test large families of tests on GPUs as a basis for modeling and bug hunting

Our Approach • Disclaimer: Testing is not guaranteed to reveal all behaviors

Prior Work • Testing Memory Models: • Pioneered by Bill Collier in ARCHTEST in 1992 [6] • TSOTool in 2004 [7] • Litmus in 2011 [2] • We extend this tool

Prior Work (GPU Memory Models) • June 2013: • Hower et al. proposed a SC for race-free memory model for GPUs [8] • Sorensen et al. proposed an operational weak GPU memory model based on available documentation [9] • 2014: • Hower et al. proposed two SC for race-free memory model for GPUs, HRF-direct and HRF-indirect [10] It remains unclear what memory model deployed GPUs implement

Testing Framework • GPU litmus test

Testing Framework • GPU litmus test • PTX instructions

Testing Framework • GPU litmus test • What memory region (shared or global) are x and y in?

Testing Framework • GPU litmus test • Are T0 and T1 in the same CTA? Or different CTAs?

Testing Framework • We consider three different GPU configurations for tests: • D-warp:S-cta-Shared: Different warp, Same CTA, targeting shared memory • D-warp:S-cta-Global: Different warp, Same CTA, targeting global memory • D-cta:S-ker-Global: Different CTA, Same kernel, targeting global memory

Testing Framework • Given a GPU Litmus test produce executable • CUDA or • OpenCL

Testing Framework • Host (CPU) generated code

Testing Framework • Kernel generated code

TESTING AND EXPOSING WEAK GPU MEMORY MODELS

TESTING AND EXPOSING WEAK GPU MEMORY MODELS

Presentation Transcript

Memory Models

Weak D Testing (D u )

Models of Memory

Memory Models

Shared Memory Consistency Protocol Verification against Weak Memory Models: Refinement via Model-checking

Understanding GPU Memory

Models of Memory

Models of memory

Models of memory

Complexity of Weak Consistency Models

Memory: Models and Research Methods

GPU Memory Model Overview

Memory consistency models

MODELS OF MEMORY

C++ Memory Models and Idioms

GPU Memory Details

Models of Memory

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Compilers, Languages, and Memory Models

Memory Consistency Models

Memory Consistency Models