This system was engineered with some really scary scenarios in mind, it could deal with sharing PCI decode space between two devices (something PCI can't really do). It could deal with setting up GPUs the OS hadn't assigned PCI resources to. It could share the limited x86 I/O ports between multiple cards. It could also do all the simple scenarios well.
Now back then maybe some of these things were necessary however nowadays really only two shared resources are left that we want
X drivers to care about. We hope the OS can assign all the PCI I/O space and PCI memory space so that X doesn't need to try and multiplex it.
Also this system has some major disadvantages, two X servers couldn't be used as the locking was all done inside one single server, so a second server would just race with the first. Graphics cards could never reliably use interrupts, since if you disable a cards MMIO region and it receives an interrupt, generally bad things will happen.
So fast forward to 2009, where we have pretty much OSes that do things correctly and we have only two shared resources we need to care about VGA i/o and VGA memory (0xa0000). Now the way VGA works on modern PCI systems, is there is a bit in each bridge which denotes whether that bridge routes the VGA decodes to its bus. On that bus there is allowed to be one device decoding VGA resources at any one time. The bridge forwarding is controlled by a bit in the bridge PCI config, the VGA decoding is controlled by the PCI COMMAND_IO and COMMAND_MEM bits. These bits however disable *all* IO and MEM accesses to the PCI device. Now most modern GPUs can disable the VGA decoding on its own, however this is GPU specific and there is no generic PCI config space bit to do this (FAILLL!!!!).
So the OS kernel needs to provide some sort of arbitration logic between processes wanting to access the VGA decodes. Reasons for wanting the VGA bits enabled for a GPU are generally doing any int10 call (like card posting or VBE mode setting), or doing console save/restore when X is starting/stopping. Some GPUs use the VGA i/os still for modesetting but these are generally very old cards. Some nasty ACPI/BIOS implementation also seem to like having VGA routed correctly for suspend/resume.
So Ben Herrenschmidt, Tiago Vignatti and myself brought the VGA arbiter into the world, the code has taken a couple of years to get the attention it needs to go upstream. The vga arbiter exposes a /dev/vga_arbiter node, that X can open and use to route the VGA io/mem decodes between any GPUs in the system. So how does this deal with the issues RAC had:
1. Multiple processes:
Since this is in the kernel, the lock is now available across processes so multiple X servers can in fact fight over the VGA.
2. Interrupt handling:
Since most of the IRQ handling for GPU devices is done in the DRM layer, the arbiter allows a callback to be registered for devices which cannot disable VGA decodes on-card, so that they can disable their irqs when their memory/io decodes are turned off.
3. Cards with local ability to disable VGA mem/io
Cards can register that they can disable VGA mem/io locally. This means these cards are removed from VGA arbitration completely. Generally for this to occur a KMS or other proper kernel driver is preferred. Since if X decides a card doesn't need arbitration another process might decide it does, so having a master kernel driver that owns the hw makes this a lot easier.
4. Crazy ACPI cards.
The arbiter does nothing if only a single card is registered with it. So it will leave the VGA decodes enabled on all single GPU systems. If a GPU was hotplugged, the driver will get a callback saying it needs to disable VGA decoding on itself. Generally we won't see the crazy ACPI issue on multi-gpu systems.
However there is one issue we are seeing with the VGA arbiter that RAC never really addressed either, interaction with kernel DRI. So I suspect we will disable DRI access for any GPU that requires VGA routing. The issues I've seen so far are all contentions issues with the arbiter lock. They can probably be worked around on a case by case basis, but really just get KMS already.
So we are hoping to upstream the kernel code for 2.6.32 and push the libpciaccess and X.org server patches to their master repos
around the time the patch is in the pci queue.
I'm sure I've missed something as I've been working on this for 4 days now, though I lost a day going in a redesign loop :-)