Thread (22 messages) 22 messages, 4 authors, 2023-12-02

Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()

From: Thinh Tran <hidden>
Date: 2023-11-30 22:29:42

On 11/17/2023 12:31 PM, Michael Chan wrote:
On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran [off-list ref] wrote:
quoted

On 11/16/2023 3:34 PM, Michael Chan wrote:
quoted
I think it will be better to add these 2 checks in tg3_reset_task().
We are already doing a similar check there for tp->pcierr_recovery so
it is better to centralize all the related checks in the same place.
I don't think tg3_dump_state() below will cause any problems.  We'll
probably read 0xffffffff for all registers and it will actually
confirm that the registers are not readable.
I'm concerned that race conditions could still occur during the handling
of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
in the dependency on the lower-level FW reset procedure. When the driver
executes tg3_dump_state(), and then follows it with tg3_reset_task(),
the EEH driver possibility changes in the PCI devices' state, but the
MMIO and DMA reset procedures may not have completed yet. Leading to a
crash in tg3_reset_task().  This patch tries to prevent that scenario.
It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
5 seconds but the actual time tg3_tx_timeout() is called varies
depending on when the TX queue is stopped.  So tg3_tx_timeout() will
be called 5 seconds or more after EEH if there are uncompleted TX
packets but we don't know precisely when.
quoted
While tg3_dump_state() is helpful, it also poses issues as it takes a
considerable amount of time, approximately 1 or 2 seconds per device.
Given our 4-port adapter, this could extend to more than 10 seconds to
write to the dmesg buffer. With the default size, the dmesg buffer may
be over-written and not retain all useful information.
If tg3_dump_state() is not useful and fills the dmesg log with useless
data, we can add the same check in tg3_dump_state() and skip it.
tg3_dump_state() is also called by tg3_process_error() so we can avoid
dumping useless data if we hit tg3_process_error() during EEH or AER.

Thanks.
I implemented the fix as you suggested and passed the tests.
I will send the next version of patch shortly.

Thanks.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help