1 / 17

Agenda

Agenda. AVX overview Proposed AVX ABI changes For IA-32 For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: Stack alignment. IA32 relative performance numbers for SPEC CPU 2K and 2006. DWARF2 change for stack alignment. Status of gcc AVX branch.

baakir
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda • AVX overview • Proposed AVX ABI changes • For IA-32 • For x86-64 • AVX and vectorizer infrastructure. • Ongoing projects by Intel gcc team: • Stack alignment. • IA32 relative performance numbers for SPEC CPU 2K and 2006. • DWARF2 change for stack alignment. • Status of gcc AVX branch. • Limit AVX vectorizer to 128bit.

  2. XMM0 YMM0 128 bits (1999) 256 bits (2010) Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector WidthA 256-bit vector extension to SSE • Intel® AVX extends all 16 XMM registers to 256bits • Intel® AVX works on either • The whole 256-bits • The lower 128-bits (like existing SSE instructions) • A drop-in replacement for all existing scalar/128-bit SSE instructions • The upper part of the register is zeroed out • No alignment fault on ld-op arithmetic operations

  3. Intel® Advanced Vector Extensions (Intel® AVX) – New Encoding System • Nearly all SSE FP instructions “promoted” to 256-bits • VADDPS YMM1, YMM2, [m256] • Nearly all(*)SSE instructions encode-able in new format • VADDPS XMM1, XMM2, [m128] • VMULSS XMM1, XMM2, [m32] • VPUNPCKHQDQ XMM1, XMM2, [m128] 128-bit and scalar promoted instructions have full inter-operability with 256-bit operations (*) instructions referencing MMX registers are NOT promoted to Intel AVX

  4. Key Intel® Advanced Vector Extensions(Intel® AVX) Features KEY FEATURES BENEFITS • Wider Vectors • Increased from 128 bit to 256 bit • Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency • Organize, access and pull only necessary data more quickly and efficiently • Enhanced Data Rearrangement • Use the new 256 bit primitives to broadcast, mask loads and permute data • Fewer register copies, better register use for both vector and scalar code • Three and four Operands, Non Destructive Syntax • Designed for efficiency and future extensibility • More opportunities to fuse load and compute operations • Flexible unaligned memory access support • Code size reduction • Extensible new opcode (VEX) Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today

  5. AVX related changes • Assembler support in binutils • Under –msse2avx, sse instructions will map to AVX instructions • New data type __m256 • Natural alignment is 32 bytes • Requires stack alignment greater than guaranteed by IA-32 and x86-64 ABI • Intrinsics for AVX instructions • Under –mavx sse intrinsics will be mapped AVX instructions • Automatic code generation • Vectorizer work ongoing

  6. Proposed AVX ABI changes • __m256 *p; a reference to *p will generate a 32-byte aligned 32-byte load • In particular, __m256 variables on stack also need to be 32-byte aligned • Has implications on aligning stack (talked yesterday) as well as parameter passing (today)

  7. Parameter passing ABIs (Linux-32) • Almost everything on stack • __m128 variables in xmm0-2, all caller save • __m128 after first 3 parameters passed on stack • gcc assumes stack aligned at 16 bytes • Inserts padding as desired • Proposed change • __m128/__m256 variables in xmm0-2/ymm0-2, all caller save • Note overlap between xmm and lower halves of ymm • After first 3 parameters, __m128/__m256 passed on stack • Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) • Insert padding as desired • Varargs dealt later

  8. Parameter passing ABIs (Linux-64) • __m128 variables in xmm0-7, all caller save • __m128 after first 8 parameters passed on stack • ABI guarantees stack aligned to 16 bytes • Inserts padding as desired • Proposed change • __m128/__m256 variables in xmm0-7/ymm0-7, all caller save • Note overlap between xmm and lower halves of ymm • After first 8 parameters, __m128/__m256 passed on stack • Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) • Insert padding as desired • Varargs dealt later

  9. Varargs (Linux-32) • Current • Everything (named, unnamed, including __m128s) passed on stack • gcc assumes stack aligned at 16 bytes • Inserts padding as desired • Proposed change • Everything (named, unnamed, including __m128/__m256) passed on stack • Stack aligned to 32 bytes if __m256 is passed on stack • Insert padding as desired

  10. Varargs (Linux-64) • Current • No change from non-varargs case except • Register al contains number of xmm registers used as parameters • ABI pages 50 (rax) and footnote 14 (al) may/may not be in contradiction • (Callee) has register save area whose layout is defined by the ABI • rdi, rsi, rdx, rcx, r8, r9 (integer registers for parameters) followed by xmm0-15 • Why xmm0-15 instead of xmm0-7 as only xmm0-7 can be for parameters? • Proposed change options • Aligned stack to 32 bytes if __m256 parameters are passed on stack • Register al contains number of xmm/ymm registers used as parameters • For __m256 parameters • Option 1: All __m256 parameters (named, unnamed) on stack • Option 2: Only unnamed __m256 parameters on stack

  11. Unprototyped functions (Linux-64) • Current • Same as prototyped + al defined • Works even if function is non varargs or varargs • Proposed change (assume __m256 as parameter) • Same as prototyped + al defined(?) • Option 1: (all ___m256 on stack): Does not work if function is a varargs function • Option 2: (unnamed __m256 on stack): Does not work if __m256 parameter is among unnamed • For unprototyped functions caller must treat as prototyped with al defined for performance • If we want unprototyped functions to work when they are really vararg functions you must extend register save area • Unknown performance penalty for all vararg functions • On Linux-32 (similar abi for __m128 to option 1), unprototyped functions do not work when really vararg functions

  12. AVX and Vectorizer • AVX • 128bit INT and 256bit FP vector operations • Can use 256bit FP vector AND, ANDN, OR, XOR to emulate 256bit INT. • 256bit vector <-> 256bit vector • 128bit vector <-> 256bit vector • Vectorizer • Doesn’t support vector conversion of different vector sizes. • Doesn’t support different vector sizes based on operations.

  13. AVX Branch Status • Implemented: • AVX code generation. -mavx generates pure AVX instructions without legacy SSE instructions. • AVX intrinsics. • AVX vectorizer is limited to 128bit. • To do: • Variable argument • Verify unwind and debug. • AVX specific tests. • Runtime intrinsic tests. • Variable arguments. • Unwind with 256bit vector. • 256bit Vectorizer support.

  14. Stack Alignment Branch • Collect stack alignment info in middle-end. • Use DW_OP_operation to describe call frame with stack alignment. • Need to handle DRAP properly without changing CFA. • Implemented x86 target hooks for stack alignment. • Added ~70 C/C++ runtime tectcases for stack alignment. • On 45nm Core 2 Duo in 32bit, compared against gcc 4.4 revision 133082 at –O2, stack alignment introduced 0% regression on SPEC CPU 2006 INT/FP, 0.3% regression on SPEC CPU 2K INT and 0.6% regressions on SPEC CPU 2K FP. • Updated gdb prologue analyzer to recognize the x86 prologues with stack alignment.

  15. Float128 on x86 • Gcc uses SSE/SSE2 to implement float128. • Gcc only supports float128 on ix86-64. • Update ia32 psABI. • Alignment. 16byte • Parameter (varargs) passing. On stack, aligned at 16byte. • Check TARGET_SSE/TARGET_SSE2/TARGET_SSE_MATH instead of TARGET_64BIT. • There is no run-time support for float128. • I/O. • String to __float128 function • Math library. • IEEE 754R • Existing float128 API implementations. • Implement float128 with DFP in a separate run-time library.

  16. Backup

More Related