170 likes | 403 Views
Agenda. AVX overview Proposed AVX ABI changes For IA-32 For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: Stack alignment. IA32 relative performance numbers for SPEC CPU 2K and 2006. DWARF2 change for stack alignment. Status of gcc AVX branch.
E N D
Agenda • AVX overview • Proposed AVX ABI changes • For IA-32 • For x86-64 • AVX and vectorizer infrastructure. • Ongoing projects by Intel gcc team: • Stack alignment. • IA32 relative performance numbers for SPEC CPU 2K and 2006. • DWARF2 change for stack alignment. • Status of gcc AVX branch. • Limit AVX vectorizer to 128bit.
XMM0 YMM0 128 bits (1999) 256 bits (2010) Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector WidthA 256-bit vector extension to SSE • Intel® AVX extends all 16 XMM registers to 256bits • Intel® AVX works on either • The whole 256-bits • The lower 128-bits (like existing SSE instructions) • A drop-in replacement for all existing scalar/128-bit SSE instructions • The upper part of the register is zeroed out • No alignment fault on ld-op arithmetic operations
Intel® Advanced Vector Extensions (Intel® AVX) – New Encoding System • Nearly all SSE FP instructions “promoted” to 256-bits • VADDPS YMM1, YMM2, [m256] • Nearly all(*)SSE instructions encode-able in new format • VADDPS XMM1, XMM2, [m128] • VMULSS XMM1, XMM2, [m32] • VPUNPCKHQDQ XMM1, XMM2, [m128] 128-bit and scalar promoted instructions have full inter-operability with 256-bit operations (*) instructions referencing MMX registers are NOT promoted to Intel AVX
Key Intel® Advanced Vector Extensions(Intel® AVX) Features KEY FEATURES BENEFITS • Wider Vectors • Increased from 128 bit to 256 bit • Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency • Organize, access and pull only necessary data more quickly and efficiently • Enhanced Data Rearrangement • Use the new 256 bit primitives to broadcast, mask loads and permute data • Fewer register copies, better register use for both vector and scalar code • Three and four Operands, Non Destructive Syntax • Designed for efficiency and future extensibility • More opportunities to fuse load and compute operations • Flexible unaligned memory access support • Code size reduction • Extensible new opcode (VEX) Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today
AVX related changes • Assembler support in binutils • Under –msse2avx, sse instructions will map to AVX instructions • New data type __m256 • Natural alignment is 32 bytes • Requires stack alignment greater than guaranteed by IA-32 and x86-64 ABI • Intrinsics for AVX instructions • Under –mavx sse intrinsics will be mapped AVX instructions • Automatic code generation • Vectorizer work ongoing
Proposed AVX ABI changes • __m256 *p; a reference to *p will generate a 32-byte aligned 32-byte load • In particular, __m256 variables on stack also need to be 32-byte aligned • Has implications on aligning stack (talked yesterday) as well as parameter passing (today)
Parameter passing ABIs (Linux-32) • Almost everything on stack • __m128 variables in xmm0-2, all caller save • __m128 after first 3 parameters passed on stack • gcc assumes stack aligned at 16 bytes • Inserts padding as desired • Proposed change • __m128/__m256 variables in xmm0-2/ymm0-2, all caller save • Note overlap between xmm and lower halves of ymm • After first 3 parameters, __m128/__m256 passed on stack • Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) • Insert padding as desired • Varargs dealt later
Parameter passing ABIs (Linux-64) • __m128 variables in xmm0-7, all caller save • __m128 after first 8 parameters passed on stack • ABI guarantees stack aligned to 16 bytes • Inserts padding as desired • Proposed change • __m128/__m256 variables in xmm0-7/ymm0-7, all caller save • Note overlap between xmm and lower halves of ymm • After first 8 parameters, __m128/__m256 passed on stack • Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) • Insert padding as desired • Varargs dealt later
Varargs (Linux-32) • Current • Everything (named, unnamed, including __m128s) passed on stack • gcc assumes stack aligned at 16 bytes • Inserts padding as desired • Proposed change • Everything (named, unnamed, including __m128/__m256) passed on stack • Stack aligned to 32 bytes if __m256 is passed on stack • Insert padding as desired
Varargs (Linux-64) • Current • No change from non-varargs case except • Register al contains number of xmm registers used as parameters • ABI pages 50 (rax) and footnote 14 (al) may/may not be in contradiction • (Callee) has register save area whose layout is defined by the ABI • rdi, rsi, rdx, rcx, r8, r9 (integer registers for parameters) followed by xmm0-15 • Why xmm0-15 instead of xmm0-7 as only xmm0-7 can be for parameters? • Proposed change options • Aligned stack to 32 bytes if __m256 parameters are passed on stack • Register al contains number of xmm/ymm registers used as parameters • For __m256 parameters • Option 1: All __m256 parameters (named, unnamed) on stack • Option 2: Only unnamed __m256 parameters on stack
Unprototyped functions (Linux-64) • Current • Same as prototyped + al defined • Works even if function is non varargs or varargs • Proposed change (assume __m256 as parameter) • Same as prototyped + al defined(?) • Option 1: (all ___m256 on stack): Does not work if function is a varargs function • Option 2: (unnamed __m256 on stack): Does not work if __m256 parameter is among unnamed • For unprototyped functions caller must treat as prototyped with al defined for performance • If we want unprototyped functions to work when they are really vararg functions you must extend register save area • Unknown performance penalty for all vararg functions • On Linux-32 (similar abi for __m128 to option 1), unprototyped functions do not work when really vararg functions
AVX and Vectorizer • AVX • 128bit INT and 256bit FP vector operations • Can use 256bit FP vector AND, ANDN, OR, XOR to emulate 256bit INT. • 256bit vector <-> 256bit vector • 128bit vector <-> 256bit vector • Vectorizer • Doesn’t support vector conversion of different vector sizes. • Doesn’t support different vector sizes based on operations.
AVX Branch Status • Implemented: • AVX code generation. -mavx generates pure AVX instructions without legacy SSE instructions. • AVX intrinsics. • AVX vectorizer is limited to 128bit. • To do: • Variable argument • Verify unwind and debug. • AVX specific tests. • Runtime intrinsic tests. • Variable arguments. • Unwind with 256bit vector. • 256bit Vectorizer support.
Stack Alignment Branch • Collect stack alignment info in middle-end. • Use DW_OP_operation to describe call frame with stack alignment. • Need to handle DRAP properly without changing CFA. • Implemented x86 target hooks for stack alignment. • Added ~70 C/C++ runtime tectcases for stack alignment. • On 45nm Core 2 Duo in 32bit, compared against gcc 4.4 revision 133082 at –O2, stack alignment introduced 0% regression on SPEC CPU 2006 INT/FP, 0.3% regression on SPEC CPU 2K INT and 0.6% regressions on SPEC CPU 2K FP. • Updated gdb prologue analyzer to recognize the x86 prologues with stack alignment.
Float128 on x86 • Gcc uses SSE/SSE2 to implement float128. • Gcc only supports float128 on ix86-64. • Update ia32 psABI. • Alignment. 16byte • Parameter (varargs) passing. On stack, aligned at 16byte. • Check TARGET_SSE/TARGET_SSE2/TARGET_SSE_MATH instead of TARGET_64BIT. • There is no run-time support for float128. • I/O. • String to __float128 function • Math library. • IEEE 754R • Existing float128 API implementations. • Implement float128 with DFP in a separate run-time library.