Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter
From: Jakub Kicinski <kuba@kernel.org>
Date: 2025-07-19 00:47:39
Also in:
intel-wired-lan, linux-doc, linux-rdma, lkml
On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
Currently, the devlink health reporter initiates the grace period immediately after recovering an error, which blocks further recovery attempts until the grace period concludes. Since additional errors are not generally expected during this short interval, any new error reported during the grace period is not only rejected but also causes the reporter to enter an error state that requires manual intervention. This approach poses a problem in scenarios where a single root cause triggers multiple related errors in quick succession - for example, a PCI issue affecting multiple hardware queues. Because these errors are closely related and occur rapidly, it is more effective to handle them together rather than handling only the first one reported and blocking any subsequent recovery attempts. Furthermore, setting the reporter to an error state in this context can be misleading, as these multiple errors are manifestations of a single underlying issue, making it unlike the general case where additional errors are not expected during the grace period. To resolve this, introduce a configurable grace period delay attribute to the devlink health reporter. This delay starts when the first error is recovered and lasts for a user-defined duration. Once this grace period delay expires, the actual grace period begins. After the grace period ends, a new reported error will start the same flow again. Timeline summary: ----|--------|------------------------------/----------------------/-- error is error is grace period delay grace period reported recovered (recoveries allowed) (recoveries blocked) With grace period delay, create a time window during which recovery attempts are permitted, allowing all reported errors to be handled sequentially before the grace period starts. Once the grace period begins, it prevents any further error recoveries until it ends.
We are rate limiting recoveries, the "networking solution" to the problem you're describing would be to introduce a burst size. Some kind of poor man's token bucket filter. Could you say more about what designs were considered and why this one was chosen?