190 likes | 338 Views
Vector Unit Assembly. bquintero@fullsail.com. Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit
E N D
Vector Unit Assembly bquintero@fullsail.com
Overview • Architecture Review • VU0 Macro Mode Instruction Set • Building a Vector Library
Review • Playstation2 has two vector units that are similar but not the same • VU0 is the CPU’s alternate processing unit • VU1 is the GS’s alternate processing unit • Each Unit has a direct pipeline to it’s respective processor • Vector Units are designed for 4Dx32bit vectors
Review • VU0/1 each have access to 32 float registers and 16 integer register • Float registers are not like PC registers; they are 128bits in size (PC is 32bit) • 128bits can fit 4 float values at once (4D vector) • Integer registers are typically used as loop counters and address calculators
dedicated CPU CORE VU0 shared bus I$ 4KB D$ 4KB SYS RAM Review • VU0 has two bus lines • One bus is dedicated to the CPU • The other bus is used to communicate with all other devices • VU0 has 4KB of $
Vector Unit Processing Speed • The graph shows some vector-math intensive function calls • 200K calls were made to each function
Macro and Micro Modes • Vector Unit Zero (VU0) has two modes • Micro mode is a mode that allows your vector processor to act as an independent CPU • A mini program is uploaded and executed in parallel to the main CPU • Macro mode allows your CPU to directly offload heavy vector computation with low overhead • Most popular method, hands down.
Micro Mode • When uploaded, the micro program is executed independent to the CPU • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine
Macro Mode • Macro mode is a much easier method of executing fast math functionality • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0 • Notes • Just because it’s in assembly does not mean it will be faster • Switching CPU focus has it’s overheads
Assembly Structure • There is typically a specific method to writing assembly routines • Load the variable data/addresses to registers • Apply vector computations to those registers • Store the result back into a variable address • Overhead of using assembly is in the load and store • Make sure that the computation stage will improve performance enough to offset the load/store overhead
Vector Unit MIPS Instructions • Coprocessor Transfer Instructions • Store / Load • Coprocessor Branch Instructions • Macro (primitive) calculation instructions • Add / Subtract / Multiply / Divide / ect… • Micro subroutine execution instructions (VU Macro Instructions)
EEVectorAdd • Adding two vectors using the EE Core (CPU) // (Vec4T *v0, Vec4T *v1, Vec4T *v2) { v2->x = v0->x + v1->x; v2->y = v0->y + v1->y; v2->z = v0->z + v1->z; v2->w = v0->w + v1->w; }
VectorAdd • Adding two vectors using the VU0 // (Vec4T *v0, Vec4T *v1, Vec4T *v2) { asm __volatile__ (" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vadd.xyzw vf07, vf05, vf06 sqc2 vf07, 0x0(%2)” : : "r" (v0) , "r" (v1), "r" (v2) ); }
EECrossProduct • Notice how we must use a temp because of the cross // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { Vec4T temp; temp.x = v1->y * v2->z - v1->z * v2->y; temp.y = v1->z * v2->x - v1->x * v2->z; temp.z = v1->x * v2->y - v1->y * v2->x; VectorCopy(&temp, cross); }
CrossProduct // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { asm __volatile__(" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vopmula.xyz ACC, vf05, vf06 # first vopmsub.xyz vf06, vf06, vf05 # - second vsub.w vf06, vf00, vf00 # w = 0 sqc2 vf06, 0x0(%2)” : // No Output : "r"(v1), "r"(v2), "r"(cross) ); }
The vopmula instruction performs an outer product The result is stored into the special purpose ACC register VF05 X Y Z VF06 X Y Z ACC X Y Z Vector Outer Product
For Next Time Read Chapters 7.3.2 – 7.4.2 Read Chapters 9.3