Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement... | linux-can

Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

From: Yunsheng Lin <hidden>
Date: 2021-05-31 12:40:21
Also in: bpf, lkml, netdev

On 2021/5/31 9:10, Yunsheng Lin wrote:

On 2021/5/31 8:40, Yunsheng Lin wrote:

quoted

On 2021/5/31 4:21, Jakub Kicinski wrote:

quoted

On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:

quoted

On 2021/5/30 2:49, Jakub Kicinski wrote:

quoted

The fact that MISSED is only cleared under q->seqlock does not matter,
because setting it and ->enqueue() are not under any lock. If the thread
gets interrupted between:

	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
	    qdisc_run_begin(q)) {

and ->enqueue() we can't guarantee that something else won't come in,
take q->seqlock and clear MISSED.

thread1                thread2             thread3
# holds seqlock
                       qdisc_run_begin(q)
                       set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()  
                       ->enqueue()  
                                            q->flags & TCQ_F_CAN_BYPASS..
                                            qdisc_run_begin() # true
                                            sch_direct_xmit()
                       qdisc_run_begin()
                       set(MISSED)

Or am I missing something?

Re-checking nolock_qdisc_is_empty() may or may not help.
But it doesn't really matter because there is no ordering
requirement between thread2 and thread3 here.

I were more focued on explaining that using MISSED is reliable
as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
empty qdisc, and forgot to mention the data race described in
RFCv3, which is kind of like the one described above:

"There is a data race as below:

      CPU1                                   CPU2
qdisc_run_begin(q)                            .
        .                                q->enqueue()
sch_may_need_requeuing()                      .
    return true                               .
        .                                     .
        .                                     .
    q->enqueue()                              .

When above happen, the skb enqueued by CPU1 is dequeued after the
skb enqueued by CPU2 because sch_may_need_requeuing() return true.
If there is not qdisc bypass, the CPU1 has better chance to queue
the skb quicker than CPU2.

This patch does not take care of the above data race, because I
view this as similar as below:

Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, becuase there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that."

Does above make sense? Or any idea to avoid it?

1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/

We agree on this one.

Could you draw a sequence diagram of different CPUs (like the one
above) for the case where removing re-checking nolock_qdisc_is_empty()
under q->seqlock leads to incorrect behavior?

When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
may have:


        CPU1                                   CPU2
  qdisc_run_begin(q)                            .
          .                                enqueue skb1
deuqueue skb1 and clear MISSED                  .
          .                        nolock_qdisc_is_empty() return true
    requeue skb                                 .
   q->enqueue()                                 .
    set MISSED                                  .
        .                                       .
 qdisc_run_end(q)                               .
        .                              qdisc_run_begin(q)
        .                             transmit skb2 directly
        .                           transmit the requeued skb1

The problem here is that skb1 and skb2  are from the same CPU, which
means they are likely from the same flow, so we need to avoid this,
right?


         CPU1                                   CPU2
   qdisc_run_begin(q)                            .
           .                                enqueue skb1
     dequeue skb1                                .
           .                                     .
netdevice stopped and MISSED is clear            .
           .                        nolock_qdisc_is_empty() return true
     requeue skb                                 .
           .                                     .
           .                                     .
           .                                     .
  qdisc_run_end(q)                               .
           .                              qdisc_run_begin(q)
           .                             transmit skb2 directly
           .                           transmit the requeued skb1

The above sequence diagram seems more correct, it is basically about how to
avoid transmitting a packet directly bypassing the requeued packet.

quoted

If there is no such case would you be willing to repeat the benchmark
with and without this test?

I had did some interesting testing to show how adjust a small number
of code has some notiable performance degrade.

1. I used below patch to remove the nolock_qdisc_is_empty() testing
   under q->seqlock.

@@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
        if (q->flags & TCQ_F_NOLOCK) {
                if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
                    qdisc_run_begin(q)) {
-                       /* Retest nolock_qdisc_is_empty() within the protection
-                        * of q->seqlock to ensure qdisc is indeed empty.
-                        */
-                       if (unlikely(!nolock_qdisc_is_empty(q))) {
-                               rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-                               __qdisc_run(q);
-                               qdisc_run_end(q);
-
-                               goto no_lock_out;
-                       }
-
                        qdisc_bstats_cpu_update(q, skb);
                        if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
                            !nolock_qdisc_is_empty(q))
@@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
                qdisc_run(q);

-no_lock_out:
                if (unlikely(to_free))
                        kfree_skb_list(to_free);
                return rc;

which has the below performance improvement:

 threads      v1             v1 + above patch          delta
    1       3.21Mpps            3.20Mpps               -0.3%
    2       5.56Mpps            5.94Mpps               +4.9%
    4       5.58Mpps            5.60Mpps               +0.3%
    8       2.76Mpps            2.77Mpps               +0.3%
   16       2.23Mpps            2.23Mpps               +0.0%

v1 = this patchset.


2. After the above testing, it seems worthwhile to remove the
   nolock_qdisc_is_empty() testing under q->seqlock, so I used below
   patch to make sure nolock_qdisc_is_empty() always return false for
   netdev queue stopped case。

--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c

@@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
 static void qdisc_maybe_clear_missed(struct Qdisc *q,
                                     const struct netdev_queue *txq)
 {
+       set_bit(__QDISC_STATE_DRAINING, &q->state);
+
+       /* Make sure DRAINING is set before clearing MISSED
+        * to make sure nolock_qdisc_is_empty() always return
+        * false for aoviding transmitting a packet directly
+        * bypassing the requeued packet.
+        */
+       smp_mb__after_atomic();
+
        clear_bit(__QDISC_STATE_MISSED, &q->state);

        /* Make sure the below netif_xmit_frozen_or_stopped()

@@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
         */
        if (!netif_xmit_frozen_or_stopped(txq))
                set_bit(__QDISC_STATE_MISSED, &q->state);
-       else
-               set_bit(__QDISC_STATE_DRAINING, &q->state);
 }

which has the below performance data:

 threads      v1          v1 + above two patch          delta
    1       3.21Mpps            3.20Mpps               -0.3%
    2       5.56Mpps            5.94Mpps               +4.9%
    4       5.58Mpps            5.02Mpps                -10%
    8       2.76Mpps            2.77Mpps               +0.3%
   16       2.23Mpps            2.23Mpps               +0.0%

So the adjustment in qdisc_maybe_clear_missed() seems to have
caused about 10% performance degradation for 4 threads case.

And the cpu topdown perf data suggested that icache missed and
bad Speculation play the main factor to those performance difference.

I tried to control the above factor by removing the inline function
and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
in sch_generic.c.

And after removing the inline mark for function in sch_generic.c
and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
checking in in sch_generic.c, we got notiable performance improvement
for 1/2 threads case(some performance improvement for ip forwarding
test too), but not for 4 threads case.

So it seems we need to ignore the performance degradation for 4
threads case? or any idea?

quoted

Sorry for dragging the review out..

.

_______________________________________________
Linuxarm mailing list -- linuxarm@openeuler.org
To unsubscribe send an email to linuxarm-leave@openeuler.org

_______________________________________________
Linuxarm mailing list -- linuxarm@openeuler.org
To unsubscribe send an email to linuxarm-leave@openeuler.org

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help