Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2026-03-15 16:19:13
Also in:
stable
On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger [off-list ref] wrote:
On Sat, 14 Mar 2026 19:29:10 +0000 William Liu [off-list ref] wrote:quoted
Looping in Jamal and Victor. On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger [off-list ref] wrote:quoted
Add a per-CPU recursion depth counter to netem_enqueue(). When netem duplicates a packet, the clone is re-enqueued at the root qdisc. If the tree contains other netem instances, this can recurse without bound, causing soft lockups and OOM. This approach was previously considered but rejected on the grounds that netem_dequeue calling enqueue on a child netem could bypass the depth check. That concern does not apply: the child netem's netem_enqueue() increments the same per-CPU counter, so the total nesting depth across all netem instances in the call chain is tracked correctly.I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2]. If I remember correctly, the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.quoted
A depth limit of 4 is generous for any legitimate configuration. Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication") Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774 Cc: stable@vger.kernel.org Reported-by: William Liu <redacted> Reported-by: Savino Dicanosa <redacted> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> --- net/sched/sch_netem.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 0ccf74a9cb82..085fa3ad6f83 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c@@ -21,6 +21,7 @@ #include <linux/rtnetlink.h> #include <linux/reciprocal_div.h> #include <linux/rbtree.h> +#include <linux/percpu.h> #include <net/gso.h> #include <net/netlink.h>@@ -29,6 +30,15 @@ #define VERSION "1.3" +/* + * Limit for recursion from duplication. + * Duplicated packets are re-enqueued at the root qdisc, which may + * reach this or another netem instance, causing nested calls to + * netem_enqueue(). This per-CPU counter limits the total depth. + */ +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth); +#define NETEM_RECURSION_LIMIT 4 + /* Network Emulation Queuing algorithm. ====================================@@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, /* Do not fool qdisc_drop_all() */ skb->prev = NULL; + /* Guard against recursion from duplication re-injection. */ + if (unlikely(this_cpu_inc_return(netem_enqueue_depth) > + NETEM_RECURSION_LIMIT)) { + this_cpu_dec(netem_enqueue_depth); + qdisc_drop(skb, sch, to_free); + return NET_XMIT_DROP; + } + /* Random duplication */ if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng)) ++count;@@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (count == 0) { qdisc_qstats_drop(sch); __qdisc_drop(skb, to_free); + this_cpu_dec(netem_enqueue_depth); return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; }@@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, qdisc_drop_all(skb, sch, to_free); if (skb2) __qdisc_drop(skb2, to_free); + this_cpu_dec(netem_enqueue_depth); return NET_XMIT_DROP; }@@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, /* Parent qdiscs accounted for 1 skb of size @prev_len */ qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len)); } else if (!skb) { + this_cpu_dec(netem_enqueue_depth); return NET_XMIT_DROP; } + this_cpu_dec(netem_enqueue_depth); return NET_XMIT_SUCCESS; } --2.51.0What about the last suggestion for a robust fix from [3]? Best, Will [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/ (local) [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/ (local) [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/ (local)Thanks, this is a tight corner here, and not all solutions work out. You're right that the per-CPU guard alone doesn't cover the dequeue-to-child-enqueue path you described. That's exactly why the series has two patches working together: Patch 02 adds the per-CPU recursion guard, which handles the direct enqueue recursion (rootq->enqueue duplicated packet hits another netem_enqueue in the same call chain). Patch 04 restructures netem_dequeue to eliminate the pump. The old code had "goto tfifo_dequeue" which looped back after each child enqueue, so packets the child duplicated back to root would immediately get picked up by the same dequeue iteration. The new code transfers all currently-ready packets from the tfifo to the child in one batch, then does a single dequeue from the child and returns. Packets that the child duplicates back to root land in the tfifo but won't be processed until the next dequeue call from the parent — breaking the loop you diagrammed. The original repro is: tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100% tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100% ping -f localhost This is covered by tdc test f2a3 (nested netem config acceptance) and test 7a07 (nested netem with duplication, traffic via scapy). More tests are added in the new version. Jamal's proposed change with skb ttl would also work but it was rejected because it required adding ttl field to skb and skb size is a performance critical. As Cong pointed out adding a couple of bits for ttl makes it increase. So considered the idea and decided against it.
It was not "rejected" - other than Cong making those claims (which i responded to). Last posting i had some feedback from Willem but i dropped the ball. And that variant i posted had issues which were not caught by the AI review - required human knowledge (it didnt consider the GRO code path). If you want to go this path - i am fine with it, I will just focus on the mirred loop. Also tell your AI to Cc the stakeholders next time it posts via get_maintainers (and not cc stable) - then i will be fine reviewing. Commit log (or cover letter) would help if you cite the "documented examples" you said were broken. cheers, jamal
Thanks.