November 19th, 2008

face

lessons learned writing a kernel memory managed driver

1. whatever corner case you never expect could ever happen - it will 5 secs after release (this is a generic sw lesson)

2. X does rendering without the vtSema, including hw calls. So if you invalidate the 3D state flags in EnterVT its too late, X has already sent command to the card without resending the state. So invalidate your state in LeaveVT as well.

3. Kernel memory management is a messy problem. The GPU has a finite amount of addressable memory it can use. On modern GPUs, this is either a single GART (like Intel) or VRAM + GART (like everyone else). So userspace applications like X or 3D apps, submit command buffers to the kernel, and along with a single command buffer there is a list of referenced data buffers. These data buffers can be pixmaps src/dst/mask, textures, vbos, fbos whateva. The userspace gives the kernel this list along with acceptable placement parameters. (GEM API uses a set of read domains, and a single write domain). If the buffer is to be written to it needs to end up in the write domain, for reading it might be acceptable in a few places. So on my radeon driver for example it is acceptable to read the buffer from either GART or VRAM, but only write to VRAM buffers. So when the kernel gets this list of buffers it tries to fit them in as well as it can. Now if the kernel can't fit these buffers in, we are in a bind. The naive person would just say fallback to software, and I spit on them. Because building the command buffers happens over time, the operation list and codepaths that generated the command stream have all been done. So we can't just go back and redo the operations in software fallbacks. So we run into the problem that userspace cannot reference more buffers in the command stream than the kernel can relocate into memory at once.

Rule 1: The kernel cannot fail to complete the command stream under any circumstances.

The two ways around this - are insert places in the command stream where its legal to break it up or have userspace flush the command stream when it hits a limit on the buffers.

For radeon I've done the latter. At startup the kernel tells X the amount of dynamic VRAM and GART it has to play with.

Then before each operation in the 2D driver, I sum up all the buffers it references for read and write, and then compare the write with the amount of VRAM and read with the amount of GART. If a single operation cannot fit at all in VRAM/GART, then sw fallback it. If the current op + total of ops in the list doesn't fit, flush before the current op and try again.

Now this works up until it falls over in a heap.

Why? Well firstly if a buffer was just written to in a previous cycle, it will be in VRAM, however X doesn't know or care where the buffer lives, so when it submits a read for it, it totals it against the GART read space, and when the kernel goes to fit everything in, it already has left that buffer in VRAM taking away from the write space. So to solve that I have the kernel do a two pass, it validates all the writeable buffers first (which if not enough space will kick out the readable buffer), the validates all the readable buffers (which will pull it back into the GART).

So this works great until it falls over in a heap.

So you submit the kernel a load of command streams over time, and VRAM gets a bit fragmented. So you have 20MB of dynamic VRAM,
you have 5MB of 1MB buffers, then 10MB free, then 5MB of 1MB buffers, you have 13MBs in 3 buffers (1MB, 1MB, 11MB), the two 1MB buffers are just before the 10MB and just after it. So the command stream validates those two buffers at those two points, then goes to validate the 11MB buffer, and goes all wtf? and it fails. See rule. 1.

Of course per-context GART tables ma
So now I needz defragmenation. Simple defragmentation is to kick out all the dynamic buffers from VRAM and revalidate them in order so they all fit in. Its messy but it should mean I get a system that doesn't violate Rule. 1.

Now I'll probably like to think about inserting scheduling points in the stream, but I'm not sure how well that wins, and whether I'd still have to worry about the fragmentation issue somehow.

Of course per-context GART tables and page table addressable VRAM makes all of this stuff a lot easier, or at least push the problem out to a lot harder to hit boundary, however see the first point I made.

So if you are ever in the enviable position of writing a kernel memory manager for a graphics card, allow me to buy you some spirits.