Fast clearing with the GS

A look into how developers leveraged the cursed GS memory design to efficiently clear memory.

I remember years ago looking at PSI-Rockin’s PS2Tek / GS Special Effects writeup and being curious how fast these clears perform in practice.

Spoilers: (This is when clearing a 640x448 32 bit framebuffer)

BigSprite: 79945 cycles
VIS Clear: 40567 cycles
Page Clear: 40496 cycles
Double Half Clear: 39199 cycles
Page Clear (double half clear): 20466 cycles
VIS Clear (interleaved): 18514 cycles
Interleaved Clear: 18399 cycles

Try it yourself or look at the code here on my GitHub: F0bes / clearbench

Clearly the winners are the interleaved clears, specifically the VIS Interleaved Clear though for reasons we’ll get into.

Interleaved Clear

To understand the VIS Interleaved Clear, we first have to understand what an interleaved clear is.

This clear relies on the fact that the memory in the GS is swizzled.
(TLDR: Memory is laid out in a way that is optimized for DRAM access.)
Because of this, the GS has memory access units called blocks, columns, and pages. For this purpose we only need to worry about pages and the arrangement of blocks inside of a page.

For these examples we will be using a CT32 and Z32 data format.
This leaves us with a page size of 64x32 pixels.
It’s important to note that Z testing / Z write is free on the GS. This means that there is no penalty for it being enabled.

Here is the block orientation in a page for a CT32 and Z32 data format*.

*CT formats are generally used for frame buffers and Z formats are generally used for the depth “Z” buffer.

A diagram showing where blocks are located in a PSMCT32 page. A diagram showing where blocks are located in a PSMZ32 page.

If you noticed, the blocks are mirrored horizontally and vertically between the Z and CT format. What’s really important here is the vertical mirroring.

Remembering that our page size is 64x32 pixels, what happens when we draw a 32x32 sprite from coords 0,0 to 31,31?
The red dots imply that the block has been written to.

Both diagrams of PSMCT32 and PSMZ32. Animated are red dots slowly marking blocks as written to.

(Don’t worry, I’m almost at the main point)

Both diagrams of PSMCT32 and PSMZ32. The left half of the CT32 block is coloured, the right half of the Z32 block is coloured.

This is the result (the last frame of the animation) of the draw.
Now, the magic happens when we set ZBP = FBP.
I’ve indicated what is written by the colour write and the depth write.

A merged diagram of Z32 and CT32 pages. The right hand side gets filled in by the depth pass. The left hand side gets filled in by the colour pass.

This results in the entire page (64x32 pixels, remember?) being modified, but we only needed to draw a 32x32 sprite!

Of course, our framebuffer isn’t 64x32 pixels. So a proper implementation of this would be something like the following.

const u32 page_count = 640 * 448 / 2048;

// fb_address doesn't have to be your frame, it can be any (page aligned) location in GS memory
// This clear can be used to clear out a depth buffer or a texture if you wanted to
GS_SET_FRAME(fb_address, ...)
// Ensure ZBP = FBP
GS_SET_ZBUF(fb_address, ...)

// We can only draw one page at a time. You have to at minimum kick two qw per page
// Set the clear colour here. This will be the colour of the first 32x32 pixels in each page
GS_SET_RGBA(0x00, 0x00, 0x00, 0x00)

for (int i = 0; i < 640; i += 64)
	for (int j = 0; j < 448; j += 32)
		GS_SET_XYZ(i << 4, j << 4, 0x011E0000)
		// Our Z write is going to clear the second half of the 32x32 pixels in each page
		// Set our Z value to the clear colour
		GS_SET_XYZ((i + 32) << 4, (j + 32) << 4, 0x011E0000)

This is probably good enough for most, but having to kick a sprite for each page gets memory expensive. Your GIF packet will be around 280QW :(.
What if I told you there is a way to get that down to ~16QW :).

The VIS Clear

If you were wondering, the VIS Clear name is attributed to the engine used by VIS Interactive / VIS Entertainment. Apparently this engine liked this clear.

Let’s look at how pages are arranged in a frame and zbuffer.

A 10x3 (with the implication that it continues downwards) table of blocks with the text Frame Buffer (FBP=0 and FBW=10)

By changing the FBW (Frame Buffer Width) size in the GS FRAME register, you can change how many pages make up the width of the frame. What happens if we set FBW to 1?

A 1x4 (with the implication that it continues downwards) table of numbered blocks starting from 0.

And this is exactly what the VIS clear does.
I just want to note, this suucckkss to emulate on a GPU.

Now that our framebuffer is a 64x4480 strip of memory*, we can draw very tall sprites, covering many pages.
Because the maximum height of a sprite is 2048 pixels, the framebuffer pointer needs to be moved after every sprite.
*In the case of a 640x448 framebuffer

The non-interleaved VIS clear would draw a few 64 width sprites going all the way down the framebuffer. But we can utilize our interleaved trick to only have to do 32 width sprites.

The Interleaved VIS Clear

At the moment I can’t seem to figure out a good way how to animate this process.
Here is my implementation of an interleaved VIS clear.

// Set the clear colour here
GS_SET_RGBA(0x00, 0x00, 0x00, 0x00)
// Set the ZBP and ZBP to the start of the memory you want to clear
// (in this case the framebuffer)
GS_SET_FRAME(fbptr, 1, GS_PSM_32)
GS_SET_XYZ(0, 0, 0x00000000)
GS_SET_XYZ((32 << 4), 2048 << 4, 0x00000000)
// Because a height of 2048 covers 64 pages (each page is 32 pixels high)
// Increment our FBP and ZBP by 64
GS_SET_FRAME(fbptr + 64, 1, GS_PSM_32)
GS_SET_ZBUF(fbptr + 64, GS_PSMZ_32)
GS_SET_XYZ(0, 0, 0x00000000)
GS_SET_XYZ((32 << 4), 2048 << 4, 0x00000000)

// Increment another 64 pages
GS_SET_FRAME(fbptr + 128, 1, GS_PSM_32)
GS_SET_ZBUF(fbptr + 128, GS_PSMZ_32)
GS_SET_XYZ(0, 0, 0x00000000)
// 384 comes from (640 * 448 / 64) % 2048
GS_SET_XYZ((32 << 4), 384 << 4, 0x00000000)

Including GIFTAGs, this is method is only around 16QW. That’s around 17 times less than the normal interleaved clear. This gives you good GS fill speed without as much bus contention from DMAC usage.

Final Thoughts

If you missed it above, you can view the code to the clears in my clearbench repository on GitHub: F0bes / clearbench

Hopefully you’ve gained some understanding about these clears. If anyone wants me to to cover any of the other clears mentioned above let me know. Do know that the ones covered here are easily the most complex.