Re: OOPS on MPC8548 board when writing RAID5 array
From: hank peng <hidden>
Date: 2009-11-13 02:45:25
Also in:
linux-raid
2009/11/13 Dan Williams [off-list ref]:
Hi Hank, Thanks for testing. On Tue, Nov 10, 2009 at 4:44 AM, hank peng [off-list ref] wrote:quoted
CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and CONFIG_ASYNC_TX_DMA options are all enabled. #mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c} #dd if=3D/dev/zero of=3D/dev/md0 bs=3D1M count=3D1000 Oops: Exception in kernel mode, sig: 5 [#1] MPC85xx CDS Modules linked in: NIP: c01c45d8 LR: c01c4d48 CTR: 00000000 REGS: c2dd5c80 TRAP: 0700 =C2=A0 Not tainted =C2=A0(2.6.31.5) MSR: 00029000 <EE,ME,CE> =C2=A0CR: 22004028 =C2=A0XER: 00000000 TASK =3D e820a580[3804] 'md0_raid5' THREAD: c2dd4000 GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 00=
001000
quoted
GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c0=
282d2c
quoted
GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2=
dd5e00
quoted
GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2=
dd5d58
quoted
NIP [c01c45d8] async_tx_quiesce+0x28/0x74[..]quoted
I checked the kernel source code, and find that this OOPS was caused by the following BUG_ON code: It is in crypto/async_tx/async_tx.c: void async_tx_quiesce(struct dma_async_tx_descriptor **tx) { =C2=A0 =C2=A0 =C2=A0 =C2=A0if (*tx) { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* if ack is alre=
ady set then we cannot be sure
quoted
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * we are referri=
ng to the correct operation
quoted
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0BUG_ON(async_tx_t=
est_ack(*tx));
quoted
=C2=A0 /* OOPS occured */Yes, this looks like a manifestation of the issue I brought up in my review of the driver [1]. =C2=A0The talitos_prep_dma_xor routine is alway=
s
acknowledging its descriptors, which it should not because that is the responsibility of the client of the api. =C2=A0When the raid code tries t=
o
attach a memcpy that depends on the xor it sees that it needs to switch to from talitos to fsldma (or software if fsldma is turned off). =C2=A0Since talitos does not have the DMA_INTERRUPT capability to trigger the channel switch we need to perform a synchronous wait for the xor to complete before submitting the memcpy. =C2=A0When the ack bit =
is
not set the xor descriptor might be recycled by the dma device driver while we are waiting for it, hence the BUG_ON().
Thanks for reply, Dan. Forgot to say, when this OOPS happened, I have not applied talitos XOR patch. I only enabled async_xx api and FSL_DMA, so here I think XOR was done by CPU and memcpy was done by DMA using async_xx api. Another interseting thing I should say is that I have tried latest stable kernel 2.6.31.6, this problem didn't exist. After I applied talitos XOR patch, it was OK too. I checked the related souce codes and it seems that there were no changes which make me feel very confused. I have been testing latest serials of kernels about XOR patch on MPC8548 board and I hope Freescale guys also can give me help.
-- Dan See the final comment: [1]: http://marc.info/?l=3Dlinux-raid&m=3D125685641412112&w=3D2
--=20 The simplest is not all best but the best is surely the simplest!