NVIDIA’s Experience with Open64

NVIDIA’s Experience with Open64 Mike MurphyNVIDIA

Outline • Why Open64 • How we use Open64 • What we did to Open64 • Future work in Open64

Compiling CUDA for GPUs • C/C++ CUDA • Application NVCC • GPU Code • GPU Code • CPU Code • executable

Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64

Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 take too long

Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 good long-term support take too long

Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 good long-term support take too long best performance (kudos to PathScale)

NVCC processing of GPU code cudafe C code for GPU nvopencc (Open64) ptx OCG object code

Changes: Rehosting Open64 • Our compiler has to run on 32 & 64bit Linux, 32 & 64bit Windows, and Mac OS. • Main Open64 source tree is only for Linux. • This is an area where sharing our changes can help grow the user base by making it easier to port Open64. • For Windows we build using Cygwin’s MINGW

Changes: Memory and registers • We don’t have a stack or fast memory • Therefore want to keep data in registers • Inline everything and optimize as much as possible • Try to keep small structs in registers by expanding struct copies into field copies (versus taking address and generating loop to do byte copy)

Changes: Vector loads and stores • Coalesce adjacent loads and stores for performance • Do this in CG: • Iterate through ops, trying to add to vectors • Check for intervening kills • Change alignment and use dummy regs for padding if helps to create wider vector (e.g. may use 4-word vector for 3-word struct).

Changes: 16bit optimization • Cheaper to use 16bit registers and operations • But C converts shorts to int. • So add pass in CG that converts back to 16bit: • Mark 16bit loads, stores, and converts • Propagate 16bit-ness forwards and backwards • Unmark 16bit-ness if cannot be 16bit • Change remaining registers and instructions to be 16bit.

Future work • 1 person -> 4 people working with Open64 • New application TBA • Merging changes into trunk • Thanks to Sun Chan and Shin! • Investigating register pressure in WOPT • Want better control of register pressure during optimization • Investigating using other features (LNO, IPA, etc)

Questions? mmurphy@nvidia.com http://www.nvidia.com/CUDA

NVIDIA’s Experience with Open64

NVIDIA’s Experience with Open64

Presentation Transcript