Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter
From: Tariq Toukan <hidden>
Date: 2025-07-27 11:00:16
Also in:
intel-wired-lan, linux-doc, linux-rdma, lkml
On 25/07/2025 3:10, Jakub Kicinski wrote:
On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:quoted
Design alternatives considered: 1. Recover all queues upon any error: A brute-force approach that recovers all queues on any error. While simple, it is overly aggressive and disrupts unaffected queues unnecessarily. Also, because this is handled entirely within the driver, it leads to a driver-specific implementation rather than a generic one. 2. Per-queue reporter: This design would isolate recovery handling per SQ or RQ, effectively removing interdependencies between queues. While conceptually clean, it introduces significant scalability challenges as the number of queues grows, as well as synchronization challenges across multiple reporters. 3. Error aggregation with delayed handling: Errors arriving during the grace period are saved and processed after it ends. While addressing the issue of related errors whose recovery is aborted as grace period started, this adds complexity due to synchronization needs and contradicts the assumption that no errors should occur during a healthy system’s grace period. Also, this breaks the important role of grace period in preventing an infinite loop of immediate error detection following recovery. In such cases we want to stop. 4. Allowing a fixed burst of errors before starting grace period: Allows a set number of recoveries before the grace period begins. However, it also requires limiting the error reporting window. To keep the design simple, the burst threshold becomes redundant.We're talking about burst on order of 100s, right?
It can be, typically up to O(num_cpus).
The implementation is quite simple, store an array the size of burst in which you can save recovery timestamps (in a circular fashion). On error, count how many entries are in the past N msecs.
I get your suggestion. I agree that it's also pretty simple to implement, and that it tolerates bursts. However, I think it softens the grace period role too much. It has an important disadvantage, as it tolerates non-bursts as well. It lacks the "burstness" distinguishability. IMO current grace_period has multiple goals, among them: a. let the auto-recovery mechanism handle errors as long as they are followed by some long-enough "healthy" intervals. b. break infinite loop of auto-recoveries, when the "healthy" interval is not long enough. Raise a flag to mark the need for admin intervention. In your proposal, the above doesn't hold. It won't prevent the infinite auto-recovery loop for a buggy system that has a constant rate of up to X failures in N msecs. One can argue that this can be addressed by increasing the grace_period. i.e. a current system with grace_period=N is intuitively moved to burst_size=X and grace_period=X*N. But increasing the grace_period by such a large factor has over-enforcement and hurts legitimate auto-recoveries. Again, the main point is, it lacks the ability to properly distinguish between 1. a "burst" followed by a healthy interval, and 2. a buggy system with a rate of repeated errors.
It's a clear generalization of current scheme which can be thought of as having an array of size 1 (only one most recent recovery time is saved).
It is a simple generalization indeed. But I don't agree it's a better filter.
quoted
The grace period delay design was chosen for its simplicity and precision in addressing the problem at hand. It effectively captures the temporal correlation of related errors and aligns with the original intent of the grace period as a stabilization window where further errors are unexpected, and if they do occur, they indicate an abnormal system state.Admittedly part of what I find extremely confusing when thinking about this API is that the period when recovery is **not** allowed is called "grace period".
Absolutely. We discussed this exact same insight internally. The existing name is confusing, but we won't propose modifying it at this point.
Now we add something called "grace period delay" in some places in the code referred to as "reporter_delay".. It may be more palatable if we named the first period "error burst period" and, well, the later I suppose it's too late to rename..
It can be named after what it achieves (allows handling of more errors) or what it is (a shift of the grace_period). I'm fine with both, don't have strong preference. I'd call it grace_period in case we didn't have one already :) Please let me know what name you prefer.