Thread (8 messages) 8 messages, 2 authors, 2019-07-01

Re: [PATCH] rt2x00: fix rx queue hang

From: Soeren Moch <hidden>
Date: 2019-07-01 10:50:16
Also in: linux-wireless, lkml, stable

Hello!

On 29.06.19 10:50, Stanislaw Gruszka wrote:
Hello

On Wed, Jun 26, 2019 at 03:28:00PM +0200, Soeren Moch wrote:
quoted
Hi Stanislaw,

the good news is: your patch below also solves the issue for me. But
removing the ENTRY_DATA_STATUS_PENDING check in
rt2x00usb_kick_rx_entry() alone does not help, while removing this check
in rt2x00usb_work_rxdone() alone does the trick.

So the real race seems to be that the flags set in the completion
handler are not yet visible on the cpu core running the workqueue. And
because the worker is not rescheduled when aborted, the entry can just
wait forever.
Do you think this could make sense?
Yes.
quoted
quoted
I'm somewhat reluctant to change the order, because TX processing
might relay on it (we first mark we wait for TX status and
then mark entry is no longer owned by hardware).
OK, maybe it's just good luck that changing the order solves the rx
problem. Or can memory barriers associated with the spinlock in
rt2x00lib_dmadone() be responsible for that?
(I'm testing on a armv7 system, Cortex-A9 quadcore.)
I'm not sure, rt2x00queue_index_inc() also disable/enable interrupts,
so maybe that make race not reproducible. 
I tested some more, the race is between setting ENTRY_DATA_IO_FAILED (if
needed) and enabling workqueue processing. This enabling was done via
ENTRY_DATA_STATUS_PENDING in my patch. So setting
ENTRY_DATA_STATUS_PENDING behind the spinlock in
rt2x00lib_dmadone()/rt2x00queue_index_inc() moved this very close to
setting of ENTRY_DATA_IO_FAILED (if needed). While still in the wrong
order, this made it very unlikely for the race to show up.
quoted
While looking at it, why we double-clear ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() directly and in rt2x00lib_dmadone() again,
rt2x00lib_dmadone() is called also on other palaces (error paths)
when we have to clear flags.
Yes, but also clearing ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() directly is not necessary and can lead to
the wrong processing order.
quoted
 while not doing the same for tx? 
If I remember correctly we have some races on rx (not happened on tx)
that was solved by using test_and_clear_bit(ENTRY_OWNER_DEVICE_DATA).
I searched in the history, it actually was the other way around. You
changed test_and_clear_bit() to test_bit() in the TX path. I think this
is also the right way to go in RX.
quoted
Would it make more sense to possibly
set ENTRY_DATA_IO_FAILED before clearing ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() as for tx?
I don't think so, ENTRY_DATA_IO_FAILED should be only set on error
case.
Yes of course. But if the error occurs, it should be signalled before
enabling the workqueue processing, see the race described above.

After some more testing I'm convinced that this would be the right fix
for this problem. I will send a v2 of this patch accordingly.
quoted
quoted
 However on RX
side ENTRY_DATA_STATUS_PENDING bit make no sense as we do not
wait for status. We should remove ENTRY_DATA_STATUS_PENDING on
RX side and perhaps this also will solve issue you observe.
I agree that removing the unnecessary checks is a good idea in any case.
quoted
Could you please check below patch, if it fixes the problem as well?
At least I could not trigger the problem within transferring 10GB of
data. But maybe the probability for triggering this bug is just lower
because ENTRY_OWNER_DEVICE_DATA is cleared some time before
ENTRY_DATA_STATUS_PENDING is set?
Not sure. Anyway, could you post patch removing not needed checks
with proper description/changelog ?
OK, I will do so.

Soeren
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help