So I (and other radeon developers) debug a lot of radeon problems, both locally and with people over irc/bugzilla, and I often am quite slow to deal with bugs that I can reproduce locally, its usually a last resort to do remote debugging and its unfortunate for people who have hw bugs that we can't reproduce locally.
So what prompted this post?
KMS:RV515:X1400 Thinkpad T60 resume fails
So first up, my local Thinkpad T60p with an rv530 always resumed fine, my local T42 with 7500 mostly works okay as well, so
there goes my local reproduction.
So Peng works for Red Hat in Beijing and the week before kernel summit I sat down on irc for most of two days with him running various tests. We tracked it down in that 2 days to the fact that his video RAM wasn't getting setup properly on resume. The NMI on resume let me track down that when the gpu accessed the ring, it generated an error on the PCI bus, this led to checking the contents of the PCIE gart table (with a detour through kernel vmap page handling). The PCIE gart table is in VRAM, and upon checking it on resume noticed when we copied it back into VRAM it was getting mangled. So we could deduce VRAM was broken.
I handed off to Jerome and he got traces of the BIOS posting using VBE and using ATOM (which we use in the kernel), Jerome ruled out different parts of the engines and we got reports of it working for some people when powered down for a long time or other randomness, and we were going back and forth with ideas on what might be going wrong, and had started thinking we should power off various parts of the hw before suspend, and the problem was due to inconsistent hw state on resume.
Peng happened to visit Brisbane for a Red Hat meeting this week and brought the T60, and this morning I swapped my laptop for his. First of all I tried to play by plugging in a VGA monitor, this produced another bug where the LVDS would die when starting KMS, so I fixed that quickly first. So the VGA screen was also corrupted, and VRAM wasn't enabled at all. Next I tested vbetool posting worked, also suspend/resume to corrupt, unload radeon, vbetool post and load radeon worked. Then I started testing with Jerome's userspace atom init tool, doing a s/r, unload radeon, atom post, load radeon also worked fine. This is where it started to make no sense, since Jeromes tool was doing the exact same thing as the kernel parser. I started by blaming the atom delay code in the kernel but that proved a dead end after an hour or two. Next thing I enabled the kernel atom debugging and all of a sudden it resumed fine. So it was a timing issue somewhere in atom parser running the init code.
So enabling debugging put enough of a delay between operations that something that wasn't working before now succeeds. I started bisecting the debug messages, I removed half the debugging at a time until after about 3 hrs I got it down to one printk happening between two atombios commands. The surrounding code was reading and writing one of the memory controller setup registers on the GPU, so it pointed to some register write not getting fully into the hw before we read it back and write it again later. I changed the atom code to do a read back before writing regs for certain operations and viola all resumed fine.
So this took the best part of 8 hours, I reckon if I'd been doing the same over irc with Peng it would have taken at least a month of back and forth on irc to figure it out. Having the hardware locally even for a day made it possible to track it down and figure it out so much quicker and efficently. So the bad news for anyone with bugs we can't reproduce locally is that we generally will fix any bugs we can locally first just from a efficiency point of view, since we can fix them so much quicker and faster.