1 / 17

Vector Unit Assembly

Vector Unit Assembly. bquintero@fullsail.com. Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

deo
Download Presentation

Vector Unit Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vector Unit Assembly bquintero@fullsail.com

  2. Overview • Architecture Review • VU0 Macro Mode Instruction Set • Building a Vector Library

  3. Review • Playstation2 has two vector units that are similar but not the same • VU0 is the CPU’s alternate processing unit • VU1 is the GS’s alternate processing unit • Each Unit has a direct pipeline to it’s respective processor • Vector Units are designed for 4Dx32bit vectors

  4. Review • VU0/1 each have access to 32 float registers and 16 integer register • Float registers are not like PC registers; they are 128bits in size (PC is 32bit) • 128bits can fit 4 float values at once (4D vector) • Integer registers are typically used as loop counters and address calculators

  5. dedicated CPU CORE VU0 shared bus I$ 4KB D$ 4KB SYS RAM Review • VU0 has two bus lines • One bus is dedicated to the CPU • The other bus is used to communicate with all other devices • VU0 has 4KB of $

  6. Vector Unit Processing Speed • The graph shows some vector-math intensive function calls • 200K calls were made to each function

  7. Macro and Micro Modes • Vector Unit Zero (VU0) has two modes • Micro mode is a mode that allows your vector processor to act as an independent CPU • A mini program is uploaded and executed in parallel to the main CPU • Macro mode allows your CPU to directly offload heavy vector computation with low overhead • Most popular method, hands down.

  8. Micro Mode • When uploaded, the micro program is executed independent to the CPU • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine

  9. Macro Mode • Macro mode is a much easier method of executing fast math functionality • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0 • Notes • Just because it’s in assembly does not mean it will be faster • Switching CPU focus has it’s overheads

  10. Assembly Structure • There is typically a specific method to writing assembly routines • Load the variable data/addresses to registers • Apply vector computations to those registers • Store the result back into a variable address • Overhead of using assembly is in the load and store • Make sure that the computation stage will improve performance enough to offset the load/store overhead

  11. Vector Unit MIPS Instructions • Coprocessor Transfer Instructions • Store / Load • Coprocessor Branch Instructions • Macro (primitive) calculation instructions • Add / Subtract / Multiply / Divide / ect… • Micro subroutine execution instructions (VU Macro Instructions)

  12. EEVectorAdd • Adding two vectors using the EE Core (CPU) // (Vec4T *v0, Vec4T *v1, Vec4T *v2) { v2->x = v0->x + v1->x; v2->y = v0->y + v1->y; v2->z = v0->z + v1->z; v2->w = v0->w + v1->w; }

  13. VectorAdd • Adding two vectors using the VU0 // (Vec4T *v0, Vec4T *v1, Vec4T *v2) {    asm __volatile__ ("     lqc2    vf05, 0x0(%0)     lqc2    vf06, 0x0(%1)     vadd.xyzw vf07, vf05, vf06     sqc2    vf07, 0x0(%2)” : : "r" (v0) , "r" (v1), "r" (v2) ); }

  14. EECrossProduct • Notice how we must use a temp because of the cross // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { Vec4T temp; temp.x = v1->y * v2->z - v1->z * v2->y; temp.y = v1->z * v2->x - v1->x * v2->z; temp.z = v1->x * v2->y - v1->y * v2->x; VectorCopy(&temp, cross); }

  15. CrossProduct // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { asm __volatile__(" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vopmula.xyz ACC, vf05, vf06 # first vopmsub.xyz vf06, vf06, vf05 # - second vsub.w vf06, vf00, vf00 # w = 0 sqc2 vf06, 0x0(%2)” : // No Output : "r"(v1), "r"(v2), "r"(cross) ); }

  16. The vopmula instruction performs an outer product The result is stored into the special purpose ACC register VF05 X Y Z VF06 X Y Z ACC X Y Z Vector Outer Product

  17. For Next Time Read Chapters 7.3.2 – 7.4.2 Read Chapters 9.3

More Related