SIMD Floating Point with VU0
If you’re unfamiliar with the acronym SIMD, it stands for Single Instruction Multiple Data. These (SIMD) instructions are used by x86 instruction sets such as SSE,SSE2,SSE3,SSE4, AVX, AVX2 and the failure that was AVX-512.
SIMD essentially allows you to do maths on multiple numbers at a time. This is faster for a few different reasons, one of which being less work for the instruction cache. If you’re curious about the optimizations related to these concepts, check out loop unrolling or vectorization.
Here is an example of two ways to achieve a broadcast multiplication, but one is using SIMD.
* Using GCC 13.2 with flags -O3 and -mno-sse for the unoptimized version.
-
// Multiply the 4 elements in the array by the first element in the array // This is called a broadcast multiply void unoptimized_array_bc_mult(float in[4]) { in[3] *= in[0]; in[2] *= in[0]; in[1] *= in[0]; in[0] *= in[0]; } void optimized_array_bc_mult(float in[4]) { // Load the array into a vector register // arr = {in[0], in[1], in[2], in[3]} __m128 arr = _mm_load_ps(in); // Load the first element into all elements in another register // {in[0], in[0], in[0], in[0]} __m128 multiplier = _mm_load1_ps(in); // Multiply the two vectors // multiplier = {in[0], in[1], in[2], in[3]} // X // {in[0], in[0], in[0], in[0]} arr = _mm_mul_ps(arr, multiplier); // Load it back to the array _mm_store_ps(in, arr); }
-
unoptimized_array_bc_mult: fld DWORD PTR [rdi] fld DWORD PTR [rdi+12] fmul st, st(1) fstp DWORD PTR [rdi+12] fld DWORD PTR [rdi+8] fmul st, st(1) fstp DWORD PTR [rdi+8] fld DWORD PTR [rdi+4] fmul st, st(1) fstp DWORD PTR [rdi+4] fmul st, st(0) fstp DWORD PTR [rdi] ret optimized_array_bc_mult: movss xmm0, DWORD PTR [rdi] shufps xmm0, xmm0, 0 mulps xmm0, XMMWORD PTR [rdi] movaps XMMWORD PTR [rdi], xmm0 ret
Anyways, enough of x86, where is the EE mips!
The EE FPU
The PS2’s EE doesn’t directly handle floating point operations, this is handled with the COP1 FPU (Floating Point Unit). When developing however, this is seamlessly handled by the compiler. Unless you’re manually writing assembly, this fact doesn’t matter to you. An easy trick to spot an FPU instruction is to look for a 1 or .s suffix.
-
float cop0_add(float a, float b) { return a + b; }
-
00101138 <cop0_add>: 03e00008 jr ra # add.s is an FPU instruction! 460d6000 add.s $f0,$f12,$f13
Sorry, no colour formatting :(
So, if I were to compile the previous unoptimized multiply function for the PS2, it would look like this.
00101138 <unoptimized_array_bc_mult>:
101138: c4800000 lwc1 $f0,0(a0)
10113c: c483000c lwc1 $f3,12(a0)
101140: c4820008 lwc1 $f2,8(a0)
101144: c4810004 lwc1 $f1,4(a0)
101148: 460018c2 mul.s $f3,$f3,$f0
10114c: 46001082 mul.s $f2,$f2,$f0
101150: 46000842 mul.s $f1,$f1,$f0
101154: 46000002 mul.s $f0,$f0,$f0
101158: e483000c swc1 $f3,12(a0)
10115c: e4820008 swc1 $f2,8(a0)
101160: e4810004 swc1 $f1,4(a0)
101164: 03e00008 jr ra
101168: e4800000 swc1 $f0,0(a0)
10116c: 00000000 nop
The FPUs Cousin, VU0
The PS2 actually has two more “FPUs”. Vector Units 0 and 1. VU0 and VU1 are fully programable processors (when used in micro mode). VU1 is used more for the graphics pipeline (it has a direct connection to the GS) while VU0 doesn’t really have a fixed purpose.
VU0 is connected directly to the EE. Just like the FPU, the EE can directly issue instructions and manipulate VU0’s registers. This usage of VU0 is called macro mode. This allows us to the use power of the VU0 without having to write an entire program for it. (Which is fun, I recommend it!)
Something to note, while the FPU is COP1, VU0 is COP2.
What makes the VUs special? Their floating point operations are purely SIMD. Each VU floating point register has 4 floating point numbers (Equivalent to an XMM x86 register), called x,y,z and w.
Putting it all together
Unfortunately there are no VU0 instruction intrinsics. I will instead have to write the VU0 instructions manually. Here is the optimized broadcast add code.
-
void optimized_array_bc_mult(float in[4]) { asm volatile ( // Move the array into the VU0 register "lqc2 $vf1, %0\n" // Multiply $vf1xyzw by $vf1x "vmulx.xyzw $vf1, $vf1, $vf1\n" // Move the VU0 register back into the array "sqc2 $vf1, %0\n" // The memory barrier is required // GCC does not know that elements 1,2,3 are modified :"=m"(in[0]):"m"(in[0]):"memory" ); }
-
00101170 <optimized_array_bc_mult>: d8810000 lqc2 $vf1,0(a0) 4be10858 vmulx.xyzw $vf1xyzw,$vf1xyzw,$vf1x 03e00008 jr ra f8810000 sqc2 $vf1,0(a0)
The benchmark results well, aren’t too exciting. Remember, we are only doing a broadcast multiply on a 4 element array. With larger amounts of data, and a more complex algorithm, the larger the performance gains.
With 10000 iterations, the unoptimized method took around 18 cycles, while the VU0 method took 12.
This of course wont be possible while a VU0 micro program is executing. This is a major design consideration when developing high performance software for the PS2. If you’re not using VU0 for micro mode however, then you’re essentially losing out on a free optimization.
Thankfully the PS2SDK provides 3D vector functions that utilize VU0. The library is ‘math3d’ and the source can currently be found here. If this doesn’t fit your needs, get ready to write to assembly :)