Thread (37 messages) 37 messages, 3 authors, 2025-06-22

Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem_dequeue

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2025-05-23 11:01:22

On Thu, May 22, 2025 at 5:03 PM William Liu [off-list ref] wrote:
On Thursday, May 22nd, 2025 at 12:34 PM, Jamal Hadi Salim [off-list ref] wrote:
quoted
quoted
If we do a per cpu global variable approach like in act_mirred.c to track nesting, then wouldn't this break in the case of having multiple normal qdisc setups run in parallel across multiple interfaces?

A single skb cannot enter netem via multiple cpus. You can have
multiple cpus entering the netem but they would be different skbs - am
i missing something? Note mirred uses per-cpu counter which should
suffice for being per skb counters.
Ah right, you are correct. This approach will be fine then. We were originally concerned about another netem qdisc being interwoven during the operations, but that is not possible.
quoted
quoted
This brings us back to the approach where we don't allow duplication in netem if a parent qdisc is a netem with duplication enabled. However, one issue we are worried about is in regards to qdisc_replace. This means this check would have to happen everytime we want to duplicate something in enqueue right? That isn't ideal either, but let me know if you know of a better place to add the check.

I didnt follow - can you be more specific?
Oh, I meant that disablement of duplication due to an ancestor netem in the qdisc tree isn't possible at class instantiation time because
qdiscs can be replaced. So we would have to do the check at every skb enqueue, which would be far from ideal. The per cpu approach is better.

Anyways, please let us know how the patch below looks - it passes all the netem category test cases in tc-testing and avoids the problem for me. Thank you for all the help so far!
looks ok from 30k feet. Comments:
You dont need a count variable anymore because the per-cpu variable
serves the same goal - so please get rid of it.
Submit a formal patchset - including at least one tdc test(your
validation tests are sufficient) then we can do a better review.
And no netdev comments are complete if we dont talk about the xmas
tree variable layout. Not your fault on the state of the lights on
that tree but dont make it worse;-> move the nest_level assignment as
the first line in the function.

cheers,
jamal
quoted hunk ↗ jump to hunk
From c96d94ab2155e18318a29510f0ee8f3983a14274 Mon Sep 17 00:00:00 2001
From: William Liu <redacted>
Date: Thu, 22 May 2025 13:08:30 -0700
Subject: [PATCH] net/sched: Fix duplication logic in netem_enqueue

netem_enqueue's duplication prevention logic is broken when multiple
netem qdiscs with duplication enabled are stacked together as seen
in [1]. Ensure that duplication does not happen more than once in a
qdisc tree with netem qdiscs. Also move the duplication logic to
happen after the initial packet has finished the enqueue process
to preserve the accuracy of the limit check.

[1] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/ (local)

Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <redacted>
Reported-by: Savino Dicanosa <redacted>
Signed-off-by: William Liu <redacted>
Signed-off-by: Savino Dicanosa <redacted>
---
 net/sched/sch_netem.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fdd79d3ccd8c..651fae1cd7d6 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -167,6 +167,8 @@ struct netem_skb_cb {
    u64         time_to_send;
 };

+static DEFINE_PER_CPU(unsigned int, enqueue_nest_level);
+
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
 {
    /* we assume we can use skb next/prev/tstamp as storage for rb_node */
@@ -454,13 +456,21 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    struct sk_buff *skb2 = NULL;
    struct sk_buff *segs = NULL;
    unsigned int prev_len = qdisc_pkt_len(skb);
+   int nest_level = __this_cpu_inc_return(enqueue_nest_level);
+   int retval;
    int count = 1;

    /* Do not fool qdisc_drop_all() */
    skb->prev = NULL;

-   /* Random duplication */
-   if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
+   /*
+    * Random duplication
+    * We must avoid duplicating a duplicated packet, but there is no
+    * good way to track this. The nest_level check disables duplication
+    * if a netem qdisc duplicates the skb in the call chain already
+    */
+   if (q->duplicate && nest_level < 1 &&
+       q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
        ++count;

    /* Drop packet? */
@@ -473,7 +483,8 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    if (count == 0) {
        qdisc_qstats_drop(sch);
        __qdisc_drop(skb, to_free);
-       return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+       retval = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+       goto dec_nest_level;
    }

    /* If a delay is expected, orphan the skb. (orphaning usually takes
@@ -528,7 +539,8 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
        qdisc_drop_all(skb, sch, to_free);
        if (skb2)
            __qdisc_drop(skb2, to_free);
-       return NET_XMIT_DROP;
+       retval = NET_XMIT_DROP;
+       goto dec_nest_level;
    }

    /*
@@ -642,9 +654,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
        /* Parent qdiscs accounted for 1 skb of size @prev_len */
        qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
    } else if (!skb) {
-       return NET_XMIT_DROP;
+       retval = NET_XMIT_DROP;
+       goto dec_nest_level;
    }
-   return NET_XMIT_SUCCESS;
+   retval = NET_XMIT_SUCCESS;
+
+dec_nest_level:
+   __this_cpu_dec(enqueue_nest_level);
+   return retval;
 }

 /* Delay the next round with a new future slot with a
--
2.43.0
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help