140 likes | 316 Views
NVIDIA’s Experience with Open64. Mike Murphy NVIDIA. Outline. Why Open64 How we use Open64 What we did to Open64 Future work in Open64. Compiling CUDA for GPUs. C/C++ CUDA Application. NVCC. GPU Code. GPU Code. CPU Code. executable. Why Open64.
E N D
NVIDIA’s Experience with Open64 Mike MurphyNVIDIA
Outline • Why Open64 • How we use Open64 • What we did to Open64 • Future work in Open64
Compiling CUDA for GPUs • C/C++ CUDA • Application NVCC • GPU Code • GPU Code • CPU Code • executable
Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64
Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 take too long
Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 good long-term support take too long
Why Open64 • We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. own gcc open64 good long-term support take too long best performance (kudos to PathScale)
NVCC processing of GPU code cudafe C code for GPU nvopencc (Open64) ptx OCG object code
Changes: Rehosting Open64 • Our compiler has to run on 32 & 64bit Linux, 32 & 64bit Windows, and Mac OS. • Main Open64 source tree is only for Linux. • This is an area where sharing our changes can help grow the user base by making it easier to port Open64. • For Windows we build using Cygwin’s MINGW
Changes: Memory and registers • We don’t have a stack or fast memory • Therefore want to keep data in registers • Inline everything and optimize as much as possible • Try to keep small structs in registers by expanding struct copies into field copies (versus taking address and generating loop to do byte copy)
Changes: Vector loads and stores • Coalesce adjacent loads and stores for performance • Do this in CG: • Iterate through ops, trying to add to vectors • Check for intervening kills • Change alignment and use dummy regs for padding if helps to create wider vector (e.g. may use 4-word vector for 3-word struct).
Changes: 16bit optimization • Cheaper to use 16bit registers and operations • But C converts shorts to int. • So add pass in CG that converts back to 16bit: • Mark 16bit loads, stores, and converts • Propagate 16bit-ness forwards and backwards • Unmark 16bit-ness if cannot be 16bit • Change remaining registers and instructions to be 16bit.
Future work • 1 person -> 4 people working with Open64 • New application TBA • Merging changes into trunk • Thanks to Sun Chan and Shin! • Investigating register pressure in WOPT • Want better control of register pressure during optimization • Investigating using other features (LNO, IPA, etc)
Questions? mmurphy@nvidia.com http://www.nvidia.com/CUDA