PlayStation 2 Data Cache Mapping
The issue was from a PS2 demo that would generate TLB misses (writes/reads to invalid virtual memory) on PCSX2.
There are many reasons why this could happen, but what was suspicious is the target address was in the 0x70004xxx range. If you remember from my other posts, the ScratchPad Ram (SPR) is a 16kb page of high speed memory virtually mapped between 0x70000000 and 0x70003FFF. By default nothing is mapped afterwards, hence the TLB lookup misses.
The Red Herring
The original assumption was that it was mapping a new page and we were witnessing the results of lackluster TLB support in PCSX2. That was until I took a peek at what the demo was mapping and noticed that the physical address was the same as the virtual address. This made zero sense, there are no known IO mapping around there and surely there is no RAM that high. This really appeared to be a developer mistake. How did this work on a console though? There is a specific exception for writing to invalid physical memory called a bus error. The documentation for the exception is as follows:
The Bus Error is an external exception, and is caused by events such as bus time-out, external bus parity errors, invalid physical memory addresses or invalid access types…
Satisfying my curiosity, I cloned the exact TLB configuration the demo sets up on my console, and wrote to the new virtual address.
Fully expecting a BUS ERROR exception screen, I was in awe due to the fact that the write went through with no exception. I hurriedly amended my homebrew and read from the same address. I found that it returns 0 no matter what you write to it after flushing the cache or making the TLB entry uncached. For some unknown reason, this physical address acts as an open bus without the expected bus error exceptions.
It’s Always DNS the Cache
I posted my findings on the issue, still confused. While my head was down trying to figure out what I’m missing, Goatman13 on GitHub mentioned the cache. I wish I saw his message sooner, but we both ended up at the same conclusion.
A little explanation of the dcache is needed now. The EE core has data cache made up of 64 bytes x 64 entries x 2-ways for a total of 8 kilobytes.
Unlike most modern processors, the cache state and settings are available to the developer to read and configure. If you noticed beforehand, I mentioned that reading from the invalid physical address after a cache flush or disabling the cache in the entry returned 0. What happens before you invoke a cache flush with a cacheable entry?
Reading from vspr
[0]->10
[1]->9
[2]->8
[3]->7
[4]->6
[5]->5
[6]->4
[7]->3
[8]->2
[9]->1
flushing cache
Reading from vspr
[0]->0
[1]->0
[2]->0
[3]->0
[4]->0
[5]->0
[6]->0
[7]->0
[8]->0
[9]->0
Our data is still accessible until we write back our cache!
If I’ve lost you at this point, here is another explanation.
We are writing to memory, our data first travels to the cache and fills up a cache entry. When we read the first time, we aren’t reading from the physical memory (that doesn’t exist), we are simply reading from the cache. Voila, our data is fetched.
Once we flush the cache, all data is evicted and written to memory. When we read the second time, the cache has no data for us. We end up fetching from the invalid physical memory, and we have lost our data.
MIPS superiority
So that’s cool and all, but it’s not really useful in its current form. The EE cache is pretty simple and easily deterministic, but crossing your fingers hoping that the cache doesn’t write back your entry is silly. That’s where manual cache configuration comes in.
I’ll just say it straight. The demo locks half of the cache.
// L=1 Locked
// R=
// V=1 Valid
// D=0 Not Dirty (Managed by the cache controller)
// PTagLo = 0x70004 (The upper 20 bits of the physical address)
const u32 desiredTagLo = 0x70004028;
for(int i = 0; i < 64; i++)
{
u32 index = i << 6; // Specifies the index of the tag (what line to write into)
asm volatile(
"mtc0 %0, $28\n" // Move the desired TagLo to the TagLo COP0 register
"sync.p\n"
"sync.l\n"
"CACHE 0x12, 0(%1)\n" // Write TagLo into the index
"sync.l\n"
::"r"(desiredTagLo), "r"(index)
);
}
When you lock a cache way, it is guaranteed to never evict the data to physical memory, which is exactly what we want here. Also, due to the fact that the cache line has two-ways, you still have cached access, just at half of the capacity.
I suppose you can call this virtual scratch pad memory? It would be interesting to see how much the performance gain is in practice. Given the largest constraint being you can’t DMA it anywhere I don’t know what you could be doing that is so memory bound with this 4KB mapping.
You can experience my EE cache changes at a blazing 2fps since PCSX2 v1.7.5953 thanks to my fixes. All I had to do is properly manage cache way locking and add an unmapped physical address handler.