..

Detecting a PS2 Emulator: The VU0 Pipeline

Nowadays, detecting if your PlayStation 2 software is running on a emulator isn’t a major concern. However, exploring the various methods we can use to achieve such this can be an educational experience.

I’m intending on making this a little series. For each method I’ll rate the difficulty, provide an explanation, and provide some code.


The VU0 Macro Mode (COP2) Pipeline

Like any other processor, VU0 instructions take time. This unit of time is usually in cycles. Instead of instructions being “blocking”, where, as soon as they are executed the next instruction always waits for the previous to continue, you asynchronously execute them in what’s known as pipelines.

Processor development is out of scope for this post so I wont get into the theory of pipelines. I will however, list the pipelines available for VU0 Macro mode.

- General Instructions
- VDIV/VSQRT
- VRSQRT
- VCALLMS/VCALLMSR

VCALLMS/VCALLMSR and VRSQRT are not going to be used here and can be ignored. The important part is how the ‘General Instructions’ and ‘VDIV/VSQRT’ are used. ‘General Instructions’ mostly read and write to and from the VF00 to VF31 registers.

VADD VF03, VF01, VF02 ; VF03 = VF01 + VF02
VMUL VF05, VF03, VF04 ; VF05 = VF03 * VF04

The VDIV and VSQRT instructions write to a special register called Q.

VDIV Q, VF01, VF02 ; Q = VF01 / VF02

Pipeline Stalls

Let’s investigate the example above. VADD has a Throughput/Latency of 1/4, this means after 4 cycles the destination register will contain the result.

The follow code however, still works as expected.

VADD VF03, VF01, VF02 ; VF03 = VF01 + VF02
VMUL VF05, VF03, VF04 ; VF05 = VF03 * VF04

This is because, the pipeline stalls until VF03 is ready (once VADD finishes). Although it looks like there is a 1 cycle difference in the assembly, once we account for the pipeline stall (due to register dependency), VMUL can’t fully execute until the 5th cycle. This is a property of the ‘General Instruction’ pipeline.

The VDIV / VSQRT pipeline is a little different however.

VDIV has a Throughput/Latency of 7/7, this means after 7 cycles the destination register (Q) will contain the result of the division.

VDIV Q, VF01, VF02  ; Q = VF01 / VF02
VADDq VF03, VF00, Q ; VF03 = Q

What’s different is that there is no pipeline stall for Q here, VADDq will immediately execute, disregarding the fact that VDIV is still executing. The result of VADDq is undefined in our example.
This isn’t a strange quirk or footgun, there are uses for this, but that is off topic now.

If you need to get the VDIV result immediately, there is a handy instruction available. This code is perfectly valid.

VDIV Q, VF01, VF02  ; Q = VF01 / VF02
VWAITQ
VADDq VF03, VF00, Q ; VF03 = Q

What’s not emulated

The following information is correct as of the current PCSX2 version 1.7.5865.

So I hear you’re asking, what’s the issue? The issue is that PCSX2 ignores the VDIV/VSQRT pipeline. VDIV/VSQRT writes the result immediately. At this point in time, it’d be somewhat detrimental to performance to track the pipeline. For the games that do use this pipeline in arguably mean ways, it’s easily remedied by what we call a ‘COP2’ patch. In the easiest case, you swap the VDIV and whatever uses Q and it works out fine. Because it’s such an easy fix, no one has opted to implement this pipeline, and you can argue that it works out to be a net positive.

Remember our undefined, bad example using VDIV? This works perfectly fine in PCSX2.

VDIV Q, VF01, VF02  ; Q = VF01 / VF02
VADDq VF03, VF00, Q ; VF03 = Q

And if you’re wondering, VWAITQ doesn’t do anything in PCSX2, it’s just a VNOP.

Finally, detecting the lack of VDIV pipeline

I think I have you primed for this now. The solution is incredibly simple despite its complex explanation. There are ways you can shave off one or two instructions. I’ll leave that as an exercise to the reader.

    ; Let the VF1x register hold 2.0f
    ; VF0x is always 0.0f

    ; Initialize the Q register to 1.0f
    VDIV Q, VF1x, VF1x ; Q = 2.0f / 2.0f = 1
    VWAITQ             ; Wait for VDIV to finish

    VDIV Q, VF0x, VF1x ; Q = 0.0f / 1.0f = 0
    ; Immediately use Q. If the pipeline is not emulated Q is 0
    ; If the pipeline is emulated, VDIV is still executing and Q = 1
    ADDq, VF1, VF0, Q  ; VF1 = 0 + Q

   ; If VF1 is 0, the pipeline is not emulated
   ; If VF1 is 1, the pipeline is emulated

And in C

int isVDIVPipelined()
{
    float num __aligned(16) = 2.0f;
    asm __volatile__(
        "QMTC2 %1, $vf1\n"        // Set VF1 to 2.0f
        // Initialize the Q register to 1.0f
        "VDIV $Q, $vf1x, $vf1x\n" // Q = 2.0f / 2.0f = 1
        "VWAITQ\n"                // Wait for VDIV to finish
        "VDIV $Q, $vf0x, $vf1x\n" // Q = 2.0f / 2.0f = 1
        // Immediately use Q. If the pipeline is not emulated Q is 0
        // If the pipeline is emulated, VDIV is still executing and Q = 1
        "VADDQ $vf1, $vf0, $Q\n"  // VF1 = 0 + Q
        "SQC2 $vf1, %0\n"         // Load back the number to the EE
        : "=m"(num)
        : "r"(num));
    return num == 1.0f;
}

Currently the only emulator with VDIV pipeline support was DobieStation out of PCSX2, Play!, and hpsx64.

I rate the difficulty of this one a 2/5.