Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
From: Thinh Tran <hidden>
Date: 2023-11-30 22:29:42
On 11/17/2023 12:31 PM, Michael Chan wrote:
On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran [off-list ref] wrote:quoted
On 11/16/2023 3:34 PM, Michael Chan wrote:quoted
I think it will be better to add these 2 checks in tg3_reset_task(). We are already doing a similar check there for tp->pcierr_recovery so it is better to centralize all the related checks in the same place. I don't think tg3_dump_state() below will cause any problems. We'll probably read 0xffffffff for all registers and it will actually confirm that the registers are not readable.I'm concerned that race conditions could still occur during the handling of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies in the dependency on the lower-level FW reset procedure. When the driver executes tg3_dump_state(), and then follows it with tg3_reset_task(), the EEH driver possibility changes in the PCI devices' state, but the MMIO and DMA reset procedures may not have completed yet. Leading to a crash in tg3_reset_task(). This patch tries to prevent that scenario.It seems fragile if you are relying on such timing. TG3_TX_TIMEOUT is 5 seconds but the actual time tg3_tx_timeout() is called varies depending on when the TX queue is stopped. So tg3_tx_timeout() will be called 5 seconds or more after EEH if there are uncompleted TX packets but we don't know precisely when.quoted
While tg3_dump_state() is helpful, it also poses issues as it takes a considerable amount of time, approximately 1 or 2 seconds per device. Given our 4-port adapter, this could extend to more than 10 seconds to write to the dmesg buffer. With the default size, the dmesg buffer may be over-written and not retain all useful information.If tg3_dump_state() is not useful and fills the dmesg log with useless data, we can add the same check in tg3_dump_state() and skip it. tg3_dump_state() is also called by tg3_process_error() so we can avoid dumping useless data if we hit tg3_process_error() during EEH or AER. Thanks.
I implemented the fix as you suggested and passed the tests. I will send the next version of patch shortly. Thanks.