Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2026-03-15 16:19:13
Also in: stable

On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger
[off-list ref] wrote:

On Sat, 14 Mar 2026 19:29:10 +0000
William Liu [off-list ref] wrote:

quoted

Looping in Jamal and Victor.

On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger [off-list ref] wrote:

quoted

Add a per-CPU recursion depth counter to netem_enqueue(). When netem
duplicates a packet, the clone is re-enqueued at the root qdisc. If
the tree contains other netem instances, this can recurse without
bound, causing soft lockups and OOM.

This approach was previously considered but rejected on the grounds
that netem_dequeue calling enqueue on a child netem could bypass the
depth check. That concern does not apply: the child netem's
netem_enqueue() increments the same per-CPU counter, so the total
nesting depth across all netem instances in the call chain is tracked
correctly.

I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].

If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.

quoted

A depth limit of 4 is generous for any legitimate configuration.

Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
Cc: stable@vger.kernel.org
Reported-by: William Liu <redacted>
Reported-by: Savino Dicanosa <redacted>

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 net/sched/sch_netem.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 0ccf74a9cb82..085fa3ad6f83 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c

@@ -21,6 +21,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/reciprocal_div.h>
 #include <linux/rbtree.h>
+#include <linux/percpu.h>

 #include <net/gso.h>
 #include <net/netlink.h>

@@ -29,6 +30,15 @@

 #define VERSION "1.3"

+/*
+ * Limit for recursion from duplication.
+ * Duplicated packets are re-enqueued at the root qdisc, which may
+ * reach this or another netem instance, causing nested calls to
+ * netem_enqueue(). This per-CPU counter limits the total depth.
+ */
+static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
+#define NETEM_RECURSION_LIMIT      4
+
 /* Network Emulation Queuing algorithm.
    ====================================

@@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    /* Do not fool qdisc_drop_all() */
    skb->prev = NULL;

+   /* Guard against recursion from duplication re-injection. */
+   if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
+                NETEM_RECURSION_LIMIT)) {
+           this_cpu_dec(netem_enqueue_depth);
+           qdisc_drop(skb, sch, to_free);
+           return NET_XMIT_DROP;
+   }
+
    /* Random duplication */
    if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
            ++count;

@@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    if (count == 0) {
            qdisc_qstats_drop(sch);
            __qdisc_drop(skb, to_free);
+           this_cpu_dec(netem_enqueue_depth);
            return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
    }

@@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
            qdisc_drop_all(skb, sch, to_free);
            if (skb2)
                    __qdisc_drop(skb2, to_free);
+           this_cpu_dec(netem_enqueue_depth);
            return NET_XMIT_DROP;
    }

@@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
            /* Parent qdiscs accounted for 1 skb of size @prev_len */
            qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
    } else if (!skb) {
+           this_cpu_dec(netem_enqueue_depth);
            return NET_XMIT_DROP;
    }
+   this_cpu_dec(netem_enqueue_depth);
    return NET_XMIT_SUCCESS;
 }

--

2.51.0

What about the last suggestion for a robust fix from [3]?

Best,
Will

[1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/ (local)
[2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/ (local)
[3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/ (local)

Thanks, this is a tight corner here, and not all solutions work out.

You're right that the per-CPU guard alone doesn't cover the
dequeue-to-child-enqueue path you described. That's exactly why
the series has two patches working together:

Patch 02 adds the per-CPU recursion guard, which handles the
direct enqueue recursion (rootq->enqueue duplicated packet hits
another netem_enqueue in the same call chain).

Patch 04 restructures netem_dequeue to eliminate the pump. The
old code had "goto tfifo_dequeue" which looped back after each
child enqueue, so packets the child duplicated back to root would
immediately get picked up by the same dequeue iteration. The new
code transfers all currently-ready packets from the tfifo to the
child in one batch, then does a single dequeue from the child and
returns. Packets that the child duplicates back to root land in
the tfifo but won't be processed until the next dequeue call from
the parent — breaking the loop you diagrammed.

The original repro is:

  tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
  tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
  ping -f localhost

This is covered by tdc test f2a3 (nested netem config acceptance)
and test 7a07 (nested netem with duplication, traffic via scapy).
More tests are added in the new version.

Jamal's proposed change with skb ttl would also work but
it was rejected because it required adding ttl field to skb
and skb size is a performance critical. As Cong pointed out
adding a couple of bits for ttl makes it increase.
So considered the idea and decided against it.

It was not "rejected" - other than Cong making those claims (which i
responded to).
Last posting i had some feedback from Willem but i dropped the ball.
And that variant i posted had issues which were not caught by the AI
review - required human knowledge (it didnt consider the GRO code
path).
If you want to go this path - i am fine with it, I will just focus on
the mirred loop.
Also tell your AI to Cc the stakeholders next time it posts via
get_maintainers (and not cc stable) - then i will be fine reviewing.
Commit log (or cover letter) would help if you cite the "documented
examples" you said were broken.

cheers,
jamal

Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help