RE: [PATCH v7 1/8] Talitos: Support for async_tx XOR offload
From: Liu Qiang-B32616 <hidden>
Date: 2012-09-12 09:45:12
Also in:
linux-crypto, linuxppc-dev
quoted
quoted
Will this engine be coordinating with another to handle memory copies? The dma mapping code for async_tx/raid is broken when dma mapping requests overlap or cross dma device boundaries [1]. [1]: http://marc.info/?l=linux-arm-kernel&m=129407269402930&w=2Yes, it needs fsl-dma to handle memcpy copies. I read your link, the unmap address is stored in talitos hwdesc, theaddress will be unmapped when async_tx ack this descriptor, I know fsl- dma won't wait this ack flag in current kernel, so I fix it in fsl-dma patch 5/8. Do you mean that? Unfortunately no. I'm open to other suggestions. but as far as I can see it requires deeper changes to rip out the dma mapping that happens in async_tx and the automatic unmapping done by drivers. It should all be pushed to the client (md). Currently async_tx hides hardware details from md such that it doesn't even care if the operation is offloaded to hardware at all, but that takes things too far. In the worst case an copy->xor chain handled by multiple channels results in : 1/ dma_map(copy_chan...) 2/ dma_map(xor_chan...) 3/ <exec copy> 4/ dma_unmap(copy_chan...) 5/ <exec xor> <---initiated by the copy_chan 6/ dma_unmap(xor_chan...) Step 2 violates the dma api since the buffers belong to the xor_chan until unmap. Step 5 also causes the random completion context of the copy channel to bleed into submission context of the xor channel which is problematic. So the order needs to be: 1/ dma_map(copy_chan...) 2/ <exec copy> 3/ dma_unmap(copy_chan...) 4/ dma_map(xor_chan...) 5/ <exec xor> <--initiated by md in a static context 6/ dma_unmap(xor_chan...) Also, if xor_chan and copy_chan lie with the same dma mapping domain (iommu or parent device) then we can map the stripe once and skip the extra maintenance for the duration of the chain of operations. This dumps a lot of hardware details on md, but I think it is the only way to get consistent semantics when arbitrary offload devices are involved.
Thanks for your answer and links, I did some investigate these days,
first, powerpc processor should be hardware assured cache coherency, it should
be ok for hardware when in step 5 (but I will avoid map same address on different
device).
second, I have a workaround to make dma_map/unmap by order when using 2 different
device to offload, I will submit next descriptor until current descriptor complete,
if (submit->flags & ASYNC_TX_ACK)
async_tx_ack(tx);
if (depend_tx)
async_tx_ack(depend_tx);
+ /* do more check to support 2 devices offload? */
+ if (dma_wait_for_async_tx(tx) == DMA_ERROR)
+ panic("%s: DMA_ERROR waiting for tx\n", __func__);
}
EXPORT_SYMBOL_GPL(async_tx_submit);
Also use your example,
1/ dma_map(copy_chan...)
2/ tx->submit(tx); async_tx_ack(tx);
3/ dma_unmap(copy_chan...)
4/ dma_map(xor_chan...)
5/ <exec xor> <-- initialized by tx->submit(tx);
6/ dma_unmap(xor_chan...)
Under this way, actually dma_run_dependency() is useless, so this can make sure copy
and xor with same page processed by order, and only one descriptor per channel is
served. dma_unmap in driver is controlled by client (tx->flags)
How's you thinking or any suggestions? I test it on our powerpc, I don't know whether
it does work on other architecture.
Thanks.
-- Dan