Thread (7 messages) 7 messages, 3 authors, 2015-05-13

AW: Possible RAID6 regression with ASYNC_TX_DMA enabled in 4.1

From: Markus Stockhausen <hidden>
Date: 2015-05-07 14:49:28
Also in: lkml

Hi Maxime,
Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Maxime Ripard [maxime.ripard@free-electrons.com]
Gesendet: Donnerstag, 7. Mai 2015 14:57
An: Neil Brown; Shaohua Li
Cc: linux-raid@vger.kernel.org; linux-kernel@vger.kernel.org; Lior Amsalem; Thomas Petazzoni; Gregory Clement; Boris Brezillon
Betreff: Possible RAID6 regression with ASYNC_TX_DMA enabled in 4.1

Hi,

I'm currently trying to add support for the PQ operations on the
marvell XOR engine, in dmaengine, obviously to be able to use async_tx
to offload these operations.

I'm testing these patches with a RAID6 array with 4 disks.

However, since the commit 59fc630b8b5f ("RAID5: batch adjacent full
stripe write", every write to that array fails with the following
stacktrace.

http://code.bulix.org/eh8iew-88342?raw
I don't know if it might be related. I added support for RAID6 Read-Modify-Write
in software XOR with some patches. The following commit mangles some lines in 
async_pq.c:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?
id=584acdd49cd2472ca0f5a06adbe979db82d0b4af

I introduced a new flag ASYNC_TX_PQ_XOR_DST that notifies the async layer
that we want to do a XOR syndrome operation instead of a full calculation.
This will enforce the software path because I guessed that hardware does not
support that case. Without hardware to check I might have missed some 
checks in the async layer.

In the upper layer ops_run_reconstruct6 will set the flag if we determined
that rmw is faster than rcw.

Can you check if rmw_level=0 fixes the issue. See:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?
id=d06f191f8ecaef4d524e765fdb455f96392fbd42
It seems to be generated by that warning here:

http://lxr.free-electrons.com/source/crypto/async_tx/async_tx.c#L173

And indeed, if we dump the status of depend_tx here, it's already been
acked.

That doesn't happen if ASYNC_TX_DMA is disabled, hence using the
software version of it, instead of relying on our XOR engine. It
doesn't happen on any commit prior to the one mentionned above, with
the exact same changes applied. These changes are meant to be
contributed, so I can definitely push them somewhere if needed.

I don't really know where to look for though, the change that is
causing this is probably the change in ops_run_reconstruct6, but I'm
not sure that this partial revert alone would work with regard to the
rest of the patch.

Maxime
Markus

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help