SIMD Floating Point with VU0

If you’re unfamiliar with the acronym SIMD, it stands for Single Instruction Multiple Data. These (SIMD) instructions are used by x86 instruction sets such as SSE,SSE2,SSE3,SSE4, AVX, AVX2 and the failure that was AVX-512.

SIMD essentially allows you to do maths on multiple numbers at a time. This is faster for a few different reasons, one of which being less work for the instruction cache. If you’re curious about the optimizations related to these concepts, check out loop unrolling or vectorization.

Here is an example of two ways to achieve a broadcast multiplication, but one is using SIMD.

* Using GCC 13.2 with flags -O3 and -mno-sse for the unoptimized version.

  • // Multiply the 4 elements in the array by the first element in the array
    // This is called a broadcast multiply
    void unoptimized_array_bc_mult(float in[4])
        in[3] *= in[0];
        in[2] *= in[0];
        in[1] *= in[0];
        in[0] *= in[0];
    void optimized_array_bc_mult(float in[4])
        // Load the array into a vector register
        // arr = {in[0], in[1], in[2], in[3]}
        __m128 arr = _mm_load_ps(in);
        // Load the first element into all elements in another register
        // {in[0], in[0], in[0], in[0]}
        __m128 multiplier = _mm_load1_ps(in);
        // Multiply the two vectors
        // multiplier = {in[0], in[1], in[2], in[3]}
        //             X
        // {in[0], in[0], in[0], in[0]}
        arr = _mm_mul_ps(arr, multiplier);
        // Load it back to the array
        _mm_store_ps(in, arr);
  • unoptimized_array_bc_mult:
        fld     DWORD PTR [rdi]
        fld     DWORD PTR [rdi+12]
        fmul    st, st(1)
        fstp    DWORD PTR [rdi+12]
        fld     DWORD PTR [rdi+8]
        fmul    st, st(1)
        fstp    DWORD PTR [rdi+8]
        fld     DWORD PTR [rdi+4]
        fmul    st, st(1)
        fstp    DWORD PTR [rdi+4]
        fmul    st, st(0)
        fstp    DWORD PTR [rdi]
        movss   xmm0, DWORD PTR [rdi]
        shufps  xmm0, xmm0, 0
        mulps   xmm0, XMMWORD PTR [rdi]
        movaps  XMMWORD PTR [rdi], xmm0

Anyways, enough of x86, where is the EE mips!


The PS2’s EE doesn’t directly handle floating point operations, this is handled with the COP1 FPU (Floating Point Unit). When developing however, this is seamlessly handled by the compiler. Unless you’re manually writing assembly, this fact doesn’t matter to you. An easy trick to spot an FPU instruction is to look for a 1 or .s suffix.

  • float cop0_add(float a, float b)
    	return a + b;
  • 00101138 <cop0_add>:
      03e00008        jr      ra
      # add.s is an FPU instruction!
      460d6000        add.s   $f0,$f12,$f13

    Sorry, no colour formatting :(

So, if I were to compile the previous unoptimized multiply function for the PS2, it would look like this.

00101138 <unoptimized_array_bc_mult>:
  101138:       c4800000        lwc1    $f0,0(a0)
  10113c:       c483000c        lwc1    $f3,12(a0)
  101140:       c4820008        lwc1    $f2,8(a0)
  101144:       c4810004        lwc1    $f1,4(a0)
  101148:       460018c2        mul.s   $f3,$f3,$f0
  10114c:       46001082        mul.s   $f2,$f2,$f0
  101150:       46000842        mul.s   $f1,$f1,$f0
  101154:       46000002        mul.s   $f0,$f0,$f0
  101158:       e483000c        swc1    $f3,12(a0)
  10115c:       e4820008        swc1    $f2,8(a0)
  101160:       e4810004        swc1    $f1,4(a0)
  101164:       03e00008        jr      ra
  101168:       e4800000        swc1    $f0,0(a0)
  10116c:       00000000        nop

The FPUs Cousin, VU0

The PS2 actually has two more “FPUs”. Vector Units 0 and 1. VU0 and VU1 are fully programable processors (when used in micro mode). VU1 is used more for the graphics pipeline (it has a direct connection to the GS) while VU0 doesn’t really have a fixed purpose.

VU0 is connected directly to the EE. Just like the FPU, the EE can directly issue instructions and manipulate VU0’s registers. This usage of VU0 is called macro mode. This allows us to the use power of the VU0 without having to write an entire program for it. (Which is fun, I recommend it!)

Something to note, while the FPU is COP1, VU0 is COP2.

What makes the VUs special? Their floating point operations are purely SIMD. Each VU floating point register has 4 floating point numbers (Equivalent to an XMM x86 register), called x,y,z and w.

Putting it all together

Unfortunately there are no VU0 instruction intrinsics. I will instead have to write the VU0 instructions manually. Here is the optimized broadcast add code.

  • void optimized_array_bc_mult(float in[4])
    	asm volatile
    		// Move the array into the VU0 register
    		"lqc2 $vf1, %0\n"
    		// Multiply $vf1xyzw by $vf1x
    		"vmulx.xyzw $vf1, $vf1, $vf1\n"
    		// Move the VU0 register back into the array
    		"sqc2 $vf1, %0\n"
    		// The memory barrier is required
    		// GCC does not know that elements 1,2,3 are modified
  • 00101170 <optimized_array_bc_mult>:
      d8810000        lqc2          $vf1,0(a0)
      4be10858        vmulx.xyzw    $vf1xyzw,$vf1xyzw,$vf1x
      03e00008        jr            ra
      f8810000        sqc2          $vf1,0(a0)

The benchmark results well, aren’t too exciting. Remember, we are only doing a broadcast multiply on a 4 element array. With larger amounts of data, and a more complex algorithm, the larger the performance gains.

With 10000 iterations, the unoptimized method took around 18 cycles, while the VU0 method took 12.

This of course wont be possible while a VU0 micro program is executing. This is a major design consideration when developing high performance software for the PS2. If you’re not using VU0 for micro mode however, then you’re essentially losing out on a free optimization.

Thankfully the PS2SDK provides 3D vector functions that utilize VU0. The library is ‘math3d’ and the source can currently be found here. If this doesn’t fit your needs, get ready to write to assembly :)