Log in


March 2015
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31

airlied [userpic]
what a difference having hardware makes....

So I (and other radeon developers) debug a lot of radeon problems, both locally and with people over irc/bugzilla, and I often am quite slow to deal with bugs that I can reproduce locally, its usually a last resort to do remote debugging and its unfortunate for people who have hw bugs that we can't reproduce locally.

So what prompted this post?
KMS:RV515:X1400 Thinkpad T60 resume fails

So first up, my local Thinkpad T60p with an rv530 always resumed fine, my local T42 with 7500 mostly works okay as well, so
there goes my local reproduction.

So Peng works for Red Hat in Beijing and the week before kernel summit I sat down on irc for most of two days with him running various tests. We tracked it down in that 2 days to the fact that his video RAM wasn't getting setup properly on resume. The NMI on resume let me track down that when the gpu accessed the ring, it generated an error on the PCI bus, this led to checking the contents of the PCIE gart table (with a detour through kernel vmap page handling). The PCIE gart table is in VRAM, and upon checking it on resume noticed when we copied it back into VRAM it was getting mangled. So we could deduce VRAM was broken.

I handed off to Jerome and he got traces of the BIOS posting using VBE and using ATOM (which we use in the kernel), Jerome ruled out different parts of the engines and we got reports of it working for some people when powered down for a long time or other randomness, and we were going back and forth with ideas on what might be going wrong, and had started thinking we should power off various parts of the hw before suspend, and the problem was due to inconsistent hw state on resume.

Peng happened to visit Brisbane for a Red Hat meeting this week and brought the T60, and this morning I swapped my laptop for his. First of all I tried to play by plugging in a VGA monitor, this produced another bug where the LVDS would die when starting KMS, so I fixed that quickly first. So the VGA screen was also corrupted, and VRAM wasn't enabled at all. Next I tested vbetool posting worked, also suspend/resume to corrupt, unload radeon, vbetool post and load radeon worked. Then I started testing with Jerome's userspace atom init tool, doing a s/r, unload radeon, atom post, load radeon also worked fine. This is where it started to make no sense, since Jeromes tool was doing the exact same thing as the kernel parser. I started by blaming the atom delay code in the kernel but that proved a dead end after an hour or two. Next thing I enabled the kernel atom debugging and all of a sudden it resumed fine. So it was a timing issue somewhere in atom parser running the init code.

So enabling debugging put enough of a delay between operations that something that wasn't working before now succeeds. I started bisecting the debug messages, I removed half the debugging at a time until after about 3 hrs I got it down to one printk happening between two atombios commands. The surrounding code was reading and writing one of the memory controller setup registers on the GPU, so it pointed to some register write not getting fully into the hw before we read it back and write it again later. I changed the atom code to do a read back before writing regs for certain operations and viola all resumed fine.

So this took the best part of 8 hours, I reckon if I'd been doing the same over irc with Peng it would have taken at least a month of back and forth on irc to figure it out. Having the hardware locally even for a day made it possible to track it down and figure it out so much quicker and efficently. So the bad news for anyone with bugs we can't reproduce locally is that we generally will fix any bugs we can locally first just from a efficiency point of view, since we can fix them so much quicker and faster.


Nice story from the trenches, this should be added somewhere for users to see.

As a corollary, there is also a conclusion for us developers: anything we do to help users track all the relevant master repositories is a good thing. After all, tracking down remote bugs is a pain, but the sooner the bug reports come in after the offending change, the easier it usually gets.

Re: Consequences

Well regressions are a different sort of problem, the issue with kms is its a completely new driver codebase, so we've got a lot of it works with userspace modesetting and vbe suspend/resume, but it fails with kms, there is nothing to bisect or point at. Hopefully as kms gets more mature we can get the testing coverage, granted I think suspend/resume will be a pain no matter what, AMD used to have a lot of engineers concentrated on making it work.

the thing with a GPU hang is its not a generic event, everyone is different, we are trying to add more info to the kernel so we can dump out the card state, and we'll hopefully be able to add hangup detection now that we have kernel mode setting. Its not usually debuggable with CPU side tools since whatever crashed the gpu probably did it 10 or 15 command packets before we noticed.

hardware library

I wonder if it would be possible to implement a hardware library so devs could borrow a certain laptop or graphics card. hardware could be donated, bought second hard or whatever. if the big linux players (redhat, novell, canonical, lf etc) could collaborate maybe it could work.

Re: hardware library

It doesn't really work, Red Hat has a fairly comprehensive QE hardware lab and I've got 3 laptops, and 8-9 desktops already, as do the other developers. However the time and cost of shipping and import duty is actually quite prohibitive, we do most of our laptop/card exchanges via ppl travelling intra-office.

Its mainly only laptops as well, discrete cards are generally easier to source and also a lot easier to debug remote. Laptops tend to have messed up bios hacks and acpi type things.


will this fix be backported to f11?

Re: wohoo!

Don't think any of this worked at all well in F11, I might get to f11 eventually but any change to its kernel takes the chance of destabilising it for everyone else.

Should this help with RS482 freeze?

I have an Radeon Xpress 200M (RS482) based laptop and ever since upgrading to the latest stable radeon driver the machine freezes ever time when it comes back from suspend. I'm not using KMS yet. Should this fix also help with this hang or is it only KMS and only RV515?

Re: Should this help with RS482 freeze?

Only kms and rv515, you should try kms on the rs482, those things are the worst offender for having crappy bugs.

So obviously if one wants a working laptop, grab the hardware the kernel hackers own personally. Perhaps a survey is in order.

glad it was a colleague's laptop

I've got an intermittent resume hang, probably kernel rather than video, with my aging AMD/VIA/ATI desktop. I doubt it'll ever be fixed unless I invite a kernel developer to live with me ;-)

"and viola all resumed fine"
The string quartet instrument that works wonders :-)