Re: 2.6.24-rc6-mm1
From: Jarek Poplawski <hidden>
Date: 2008-01-05 00:05:19
Also in:
lkml
On Fri, Jan 04, 2008 at 04:21:26PM +0100, Torsten Kaiser wrote:
On Jan 4, 2008 2:30 PM, Jarek Poplawski [off-list ref] wrote:
...
I'm open for any suggestions and will try to answer any questions.
I'm very glad, thanks!
The only thing that is sadly not practical is bisecting the borkenout mm-patches, as triggering this error is to unreliable / time-consuming.
Right, but it seems there are these 2 main suspects here...
quoted
- is it still vanilla -rc6-mm1; I've seen on kernel list you tried some fixes around raid?Yes, without these fixes I can't boot. But they should only be run during starting the arrays, so I doubt that this is that cause. (Also -rc3-mm2 did not need this fix)
You've written vanilla -rc6 is OK. Does it mean -rc6 with these fixes? I think it would be easier just to start with this working -rc6 and simply check if we have 'right' suspects, so: git-net.patch and git-nfsd.patch from -mm1-broken-out, as suggested by Herbert (I hope, can compile - otherwise you could try the other way: add the whole -mm and revert these two). Using current gits could complicate this "investigation".
My skbuff-double-free-detector is still in there, but was never triggered.quoted
- could you remind this lockdep warning; is it always and the same, always before crash, or no rules???? I see no lockdep warning before the crashes. I have seen a warning about the dst->__refcnt in dst_release and different warnings about list operations. I think I have always posted everything I have seen before the crashes. (captured via serial console)
So, you mean there are no more of these?: "looked into the log in question and the only other warning was a circular locking dependency that lockdep detected around 1.5 hour before this warning." ... "[ 7620.845168] INFO: lockdep is turned off."
(If you mean the lockdep-problem in -rc6: That is more or less a missing annotation during early bootup. The only problem with that is, that it will causes lockdep to be turned off and so it can not be used to find any real problem. A fix for that is in -mm so I do have lockdep on the mm-kernels)quoted
- I've seen you looked after double freeing, but this last debug list warning could suggest locking problems during list modification too.Yes, but Herbert mentioned double freeing a skb explicit and so I tried to catch this. I do not know enough about the network core to verify the locking of the involved lists.
Right, the list corruption could be because of use after freeing too.
quoted
- above git-nfsd and git-net tests should be probably repeated with -rc6-mm1 git versions: so vanilla rc6 plus both these -mm patches only, and if bug triggers, with one reversed; btw., since in previous message you mentioned that 50 packages could be not enough to trigger this, these 54 above could make too little margin yet.Yes, I think I really need to redo the git-nfsd-test. With IOMMU_DEBUG enabled rc6-mm1worked for 52 packages, only a secound run of kde-packages triggered it after only 5 packages. I don't know what this bug hates about kdeartwork-wallpaper (triggered it this time) or kdeartwork-styles.
I didn't read all this thread, so probably I miss many points, but are you sure there are no problems with filesystem corruption around these packets or where you compile(?) them (e.g. after these raid problems)?
Output from the crash with IOMMU_DEBUG (lockdep was enabled, but did not trigger): [15593.236374] Unable to handle kernel NULL pointer dereference<3>list_add corruption. prev->next should be next
Fine! I'll try to look at this. BTW, I guess/hope DEBUG_SLAB etc. are also on... Thanks, Jarek P.