320 likes | 432 Views
A. Sazegari AltiVec Technical Lead. Introduction. AltiVec™ is an extension to the PowerPC Instruction Set Architecture Designed to extend Apple’s leadership position in multimedia processing. AltiVec is a trademark of Motorola, Inc. What You’ll Learn. About the AltiVec Architecture
E N D
A. Sazegari AltiVec Technical Lead
Introduction • AltiVec™ is an extension to the PowerPC Instruction Set Architecture • Designed to extend Apple’s leadership position in multimedia processing AltiVec is a trademark of Motorola, Inc.
What You’ll Learn • About the AltiVec Architecture • Its performance potential • AltiVec programming
AltiVec Technology • Vector/SIMD technology • Fixed-length vector operands (packed data) • Single Instruction Multiple Data • RISC-style instruction set • Optimized for digital signal processing • Elevates multimedia to first-class data type • Useful wherever data-parallelism exists
AltiVec Architecture • New Vector Register File: • 32 new 128-bit wide registers • New data-types: • Packed byte, halfword, and word integers • Packed IEEE single-precision floats • Saturation Arithmetic capability • 160 new PowerPC instructions
PowerPC Architecture Branch Unit IU FPU Instruction Stream GRF FPRF 64 32 Memory
AltiVec Architecture Branch Unit IU FPU Vector Unit Instruction Stream GRF FPRF Vector Register File 128 64 32 Memory
Programming Model Branch Registers • Separate Vector Register File • More space for coefficients, variables, etc. • More names for scheduling • Wider for more parallelism • No interference with FP or integer Cond Count Link Time Time VRSave 128-bits 32-bits 64-bits GPR0 FPR0 VR0 Vector Register File • • • • • • • • Floating-Point Register File General Reg. File 32-registers VR31 FPR31 GPR31 XER FPSCR VSCR
Vector Data Types One Vector (128 bits) 16 signed or unsigned integer bytes 8 signed or unsigned integer halfwords 4 signed or unsigned integer words or 4 IEEE single-precision floating-point numbers
Simple SIMD Example T = vec_adds (A, B); // vector signed short T, A, B VRA VRB vaddshs T, A, B + + + + + + + + VRT • 8 halfword additions in one instruction • Saturation arithmetic (clamp to max or min on overflow)
Vector Dot Product VRA1 VRB1 X X X X X X X X X X X X X X X X vec_msum( ) VRC1 ∑ ∑ ∑ ∑ VRT1/A2 VRB2 vec_sums( ) ∑ VRT2
Arithmetic Operations • Add, Subtract, Average • Multiply, Multiply-add, Multiply-sum • Logicals (and, andc, or, nor, xor) • Rotates and shifts • Compares • Convert float <—> fixed (scaled) • ÷ and √ via Newton-Raphson refinement of reciprocal estimate
Vector Permute T = vec_perm (A, B, C); VRC 17 18 D E F 1E 1 0 12 11 10 A 14 14 14 14 VRA VRB 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F VRT • Arbitrary bytewise data reorganization • Small table-lookup
Compare and Select VRA1 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 vec_cmpeq( ) = = = = = = = = = = = = = = = = VRT1/C2 00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00 VRA1/A2 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB2 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A vec_sel( ) VRT2 C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1
Other AltiVec Instructions • Load and Store (vector or scalar element) • Pack, Unpack, and Merge elements • Splat (element or literal replication) • Bitwise vector shifts • Double-vector bytewise shifts
Data Stream Prefetch • Software directed prefetch into cache • 4 simultaneous streams • Independent and asynchronous • Can be non-contiguous Block Size = 0-32 Vectors 0-256 Blocks 1 2 3 N Memory Stride = ±32KBytes
Typical Implementation • ALL instructions fully-pipelined with single-cycle throughput • Simple ops: 1 cycle latency • Compound ops: 3–4 cycle latency • Dual AltiVec instruction issue • One arithmetic, one “permute” • No restriction on issue with scalar instructions
AltiVec vs. MMX • Both SIMD, but AltiVec: • Does everything MMX does, plus • Twice the SIMD parallelism • 4x the register namespace • 8x the register storage space • No mode switch or use overhead • Permute • Richer set of DSP instructions
AltiVec Performance • Peak Performance • Multimedia “kernels” • DSP benchmarks • Performance based on cycle-accurate simulator with real memory effects included • Performance stated relative to optimized PowerPC scalar code
Peak Performance • Vector operations at 400MHz: • Integer • 12.8 billion arithmetic ops/sec • + 6.4 billion byte crossbar ops/sec • Floating-point • 3.2 gigaflops • + 1.6 billion FP crossbar ops/sec
Multimedia Kernels • Video and Audio • 11.4x Discrete Cosine Transform (DCT) • 16.1x* Motion estimation (* by ∑|A-B|) • 12.5x Quantization • 9.6x RGB -> YCbCr (CCIR601) • 3.6x Inverse FFT (FP) • 4.9x Windowing (FP)
Multimedia Kernels • Image Processing • 6.2x Bilinear interpolation • 1.1cy/px Separable convolution • 2.2cy/px RGB to YUV • 1.3cy/px Median Filter (3x3)
Multimedia Kernels • Graphics • 6.2x Vector-matrix multiply (FP) • 17.5x Buffer accumulation • 6.6x Line clipping • 6.3x Bezier curves
Communication Kernels • Modems and Telephony • 2.5x CRC-32 • 10.5x 64-QAM Demodulator • 7.6x Linear prediction • 9.3x Real 13-tap FIR • 30.7x Autocorrelation • 12.5x GSM Module 4.2.11
Miscellaneous DSP Kernels • Miscellaneous • 2.5 to 20x Parallel table lookup • 10.0x Sorting • 5.8x Associative search • 16.0x Galois field multiply • 4.0x Gamma Correction • 12.0cy/block Haar Transform (wavelet)
DSP Benchmarks • Results from an independent DSP benchmarking firm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is: • Twice as fast as the world’s fastest DSP (TMS320C6201) per clock, and four times faster including frequency • 2 to 5 times faster than Pentium™ II per clock (but µP would still be 35% smaller)
AltiVec Tools • Programming Model and ABI • Compilers and assemblers • Motorola’s MCC CodeWarrior plug-in • Apple’s MrC and PPCASM in MPW and MW • Metrowerks C/C++ • Emulator/Trace generator • MacsBug • Cycle-accurate simulator • Performance profiler
Programming in C • 11 new fundamental packed data types • AltiVec operators • Parse like function calls • Specific operators —> assembly instructions • Generic operators type sensitive • sizeof(), a=b, &a, *p, etc. • Compiler does register allocation, inlining, code scheduling, etc.
C Program Example zero = ( vector unsigned long ) ( 0 ); // zero = vec_xor ( zero, zero ); shiftFactor = vec_splat_u8 ( 11 ); z = vec_sro ( x, shiftFactor ); z = vec_srl ( z, shiftFactor ); do { carry = vec_addc ( z, y ); z = vec_add ( z, y ); y = vec_sld ( carry, zero, 4 ); } while ( !vec_all_eq ( y, zero ) );
Vector Shifts This ‘shiftFactor’ vector is populated in 2 sections for “vector shift right by octet” vsro and “vector shift right” vsr bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || <------ vsro -------> || <---- vsr ----> || vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift. Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of ‘shiftFactor’. bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||
AltiVec at Apple • Mac OS (blockmove, etc.) • QuickDraw • QTML (codecs, rasterizers…) • Media source code library • g4@apple.com
AltiVec Summary • Major architectural extension will make future PowerPCs great media processors • Early programming tools available now • Development systems 2H98 (Now) • AltiVec based systems in 1H99