Re: [PATCH net-next v2] tun: support NAPI for packets received from batched XDP buffs

From: Jason Wang <jasowang@redhat.com>
Date: 2022-02-28 04:20:28
Also in: bpf, lkml

On Mon, Feb 28, 2022 at 12:06 PM Eric Dumazet [off-list ref] wrote:


On 2/25/22 01:02, Harold Huang wrote:

quoted

In tun, NAPI is supported and we can also use NAPI in the path of
batched XDP buffs to accelerate packet processing. What is more, after
we use NAPI, GRO is also supported. The iperf shows that the throughput of
single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
Gbps nearly reachs the line speed of the phy nic and there is still about
15% idle cpu core remaining on the vhost thread.

Test topology:

[iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]

Iperf stream:

Before:
...
[  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
[  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
[  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
[  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
[  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
[  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver

After:
...
[  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
[  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
[  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
[  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
[  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
[  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver
....

Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com (local)
Signed-off-by: Harold Huang <redacted>
---
v1 -> v2
  - fix commit messages
  - add queued flag to avoid void unnecessary napi suggested by Jason

  drivers/net/tun.c | 20 ++++++++++++++++----
  1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index fed85447701a..c7d8b7c821d8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c

@@ -2379,7 +2379,7 @@ static void tun_put_page(struct tun_page *tpage)
  }

  static int tun_xdp_one(struct tun_struct *tun,
-                    struct tun_file *tfile,
+                    struct tun_file *tfile, int *queued,
                     struct xdp_buff *xdp, int *flush,
                     struct tun_page *tpage)
  {

@@ -2388,6 +2388,7 @@ static int tun_xdp_one(struct tun_struct *tun,
      struct virtio_net_hdr *gso = &hdr->gso;
      struct bpf_prog *xdp_prog;
      struct sk_buff *skb = NULL;
+     struct sk_buff_head *queue;
      u32 rxhash = 0, act;
      int buflen = hdr->buflen;
      int err = 0;

@@ -2464,7 +2465,15 @@ static int tun_xdp_one(struct tun_struct *tun,
          !tfile->detached)
              rxhash = __skb_get_hash_symmetric(skb);

-     netif_receive_skb(skb);
+     if (tfile->napi_enabled) {
+             queue = &tfile->sk.sk_write_queue;
+             spin_lock(&queue->lock);
+             __skb_queue_tail(queue, skb);
+             spin_unlock(&queue->lock);
+             (*queued)++;
+     } else {
+             netif_receive_skb(skb);
+     }

      /* No need to disable preemption here since this function is
       * always called with bh disabled

@@ -2492,7 +2501,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
      if (ctl && (ctl->type == TUN_MSG_PTR)) {
              struct tun_page tpage;
              int n = ctl->num;
-             int flush = 0;
+             int flush = 0, queued = 0;

              memset(&tpage, 0, sizeof(tpage));

@@ -2501,12 +2510,15 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)

              for (i = 0; i < n; i++) {
                      xdp = &((struct xdp_buff *)ctl->ptr)[i];
-                     tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
+                     tun_xdp_one(tun, tfile, &queued, xdp, &flush, &tpage);


How big n can be ?

BTW I could not find where m->msg_controllen was checked in tun_sendmsg().

struct tun_msg_ctl *ctl = m->msg_control;

if (ctl && (ctl->type == TUN_MSG_PTR)) {

     int n = ctl->num;  // can be set to values in [0..65535]

     for (i = 0; i < n; i++) {

         xdp = &((struct xdp_buff *)ctl->ptr)[i];


I really do not understand how we prevent malicious user space from
crashing the kernel.

It looks to me the only user for this is vhost-net which limits it to
64, userspace can't use sendmsg() directly on tap.

Thanks

quoted

              }

              if (flush)
                      xdp_do_flush();

+             if (tfile->napi_enabled && queued > 0)
+                     napi_schedule(&tfile->napi);
+
              rcu_read_unlock();
              local_bh_enable();

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help