1.37k likes | 1.62k Views
TESTING AND EXPOSING WEAK GPU MEMORY MODELS. MS Thesis Defense b y Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014. Joint Work with:
E N D
TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014
Joint Work with: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), Alastair Donaldson, John Wickerson,(Imperial College London), Mark Batty (University of Cambridge)
Roadmap • Background and Approach • Prior Work • Testing Framework • Results • CUDA Spin Locks • Bulk Testing • Future Work and Conclusion
Roadmap • Background and Approach • Prior Work • Testing Framework • Results • CUDA Spin Locks • Bulk Testing • Future Work and Conclusion
GPU Background • GPU is a highly parallel co-processor • Currently found in devicesfrom tablets to top supercomputers (Titan) • Not just used for visualization anymore! Images from Wikipedia [16,17,18]
GPU Programming Explicit Hierarchical concurrency model • Thread Hierarchy: • Thread • Warp • CTA (Cooperative Thread Array) • Kernel (GPU program) • Memory Hierarchy: • Shared Memory • Global Memory
GPU Programming • GPUs are SIMT (Single Instruction, Multiple Thread) • NVIDIA GPUs may be programmed using CUDA or OpenCL
Weak Memory Models • Consider the test known as Store Buffering (SB)
Weak Memory Models • Consider the test known as Store Buffering (SB) • Initial State: x and y are memory locations
Weak Memory Models • Consider the test known as Store Buffering (SB) • Thread IDs
Weak Memory Models • Consider the test known as Store Buffering (SB) • Program: for each thread ID
Weak Memory Models • Consider the test known as Store Buffering (SB) • Assertion: question about the final state of registers
Weak Memory Models • Consider the test known as Store Buffering (SB) • Can this assertion be satisfied?
Assertion cannot be satisfied by interleavings This is known as sequential consistency (or SC) [1]
Weak Memory Models • Can we assume assertion will never pass?
Weak Memory Models • Can we assume assertion will never pass? No!
Weak Memory Models • Executing this test with the Litmus tool [2] on an Intel i7 x86 processor for 1000000 iterations, we get the following histogram of results:
Weak Memory Models • What Happened? • Architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. • On x86 architectures, the hardware is allowed to re-order write instructions with program-order later read instructions [3]
GPU Memory Models • What type of memory model do current GPUs implement? • Documentation is sparse • CUDA has 1 page + 1 example [4] • PTX has 1 page + 0 examples [5] • No specifics about which instructions are allowed to be re-ordered • We need to know if we are to write correct GPU programs!
Our Approach • Empirically explore the memory model implemented on deployed NVIDIA GPUs • Achieved by developing a memory model testing tool for NVIDIA GPUs with specialized heuristics • We analyze classic memory model properties and CUDA applications in this framework with unexpected results • We test large families of tests on GPUs as a basis for modeling and bug hunting
Our Approach • Disclaimer: Testing is not guaranteed to reveal all behaviors
Roadmap • Background and Approach • Prior Work • Testing Framework • Results • CUDA Spin Locks • Bulk Testing • Future Work and Conclusion
Prior Work • Testing Memory Models: • Pioneered by Bill Collier in ARCHTEST in 1992 [6] • TSOTool in 2004 [7] • Litmus in 2011 [2] • We extend this tool
Prior Work (GPU Memory Models) • June 2013: • Hower et al. proposed a SC for race-free memory model for GPUs [8] • Sorensen et al. proposed an operational weak GPU memory model based on available documentation [9] • 2014: • Hower et al. proposed two SC for race-free memory model for GPUs, HRF-direct and HRF-indirect [10] It remains unclear what memory model deployed GPUs implement
Roadmap • Background and Approach • Prior Work • Testing Framework • Results • CUDA Spin Locks • Bulk Testing • Future Work and Conclusion
Testing Framework • GPU litmus test
Testing Framework • GPU litmus test • PTX instructions
Testing Framework • GPU litmus test • What memory region (shared or global) are x and y in?
Testing Framework • GPU litmus test • Are T0 and T1 in the same CTA? Or different CTAs?
Testing Framework • We consider three different GPU configurations for tests: • D-warp:S-cta-Shared: Different warp, Same CTA, targeting shared memory • D-warp:S-cta-Global: Different warp, Same CTA, targeting global memory • D-cta:S-ker-Global: Different CTA, Same kernel, targeting global memory
Testing Framework • Given a GPU Litmus test produce executable • CUDA or • OpenCL
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Host (CPU) generated code
Testing Framework • Kernel generated code
Testing Framework • Kernel generated code
Testing Framework • Kernel generated code
Testing Framework • Kernel generated code
Testing Framework • Kernel generated code