Re: [RFC PATCH 00/24] Introducing AF_XDP support

[RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 02/24] xsk: add user memory registration sockopt · Björn Töpel <hidden> · 2018-01-31
Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt · Willem de Bruijn <willemdebruijn.kernel@gmail.com> · 2018-02-07
Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt · Björn Töpel <hidden> · 2018-02-07
[RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect · Björn Töpel <hidden> · 2018-01-31
Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect · Jesper Dangaard Brouer <hidden> · 2018-02-05
Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect · Björn Töpel <hidden> · 2018-02-07
[RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 07/24] xsk: introduce Tx functionality · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 08/24] i40e: add support for XDP_REDIRECT · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 09/24] samples/bpf: added xdpsock program · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 12/24] xsk: add iterator functions to xsk_ring · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 13/24] i40e: introduce external allocator support · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 14/24] i40e: implemented page recycling buff_pool · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 15/24] i40e: start using recycling buff_pool · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 17/24] xsk: introduce xsk_buff_pool · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 19/24] xsk: add support for zero copy Rx · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 20/24] xsk: add support for zero copy Tx · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 22/24] i40e: introduced a clean_tx callback function · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 23/24] i40e: introduced Tx completion callbacks · Björn Töpel <hidden> · 2018-01-31
[RFC PATCH 24/24] i40e: Tx support for zero copy allocator · Björn Töpel <hidden> · 2018-01-31
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Jesper Dangaard Brouer <hidden> · 2018-02-01
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Jesper Dangaard Brouer <hidden> · 2018-02-02
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-02-05
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Willem de Bruijn <willemdebruijn.kernel@gmail.com> · 2018-02-07
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-02-07
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Willem de Bruijn <willemdebruijn.kernel@gmail.com> · 2018-02-08
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Tom Herbert <hidden> · 2018-02-07
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-02-07
Re: [RFC PATCH 00/24] Introducing AF_XDP support · William Tu <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Jesper Dangaard Brouer <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · William Tu <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-03-27
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Jesper Dangaard Brouer <hidden> · 2018-03-27
Re: [RFC PATCH 00/24] Introducing AF_XDP support · William Tu <hidden> · 2018-03-28
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Jesper Dangaard Brouer <hidden> · 2018-03-28
Re: [RFC PATCH 00/24] Introducing AF_XDP support · William Tu <hidden> · 2018-03-28
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Tushar Dave <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Alexander Duyck <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Tushar Dave <hidden> · 2018-03-26
Re: [RFC PATCH 00/24] Introducing AF_XDP support · William Tu <hidden> · 2018-03-28
Re: [RFC PATCH 00/24] Introducing AF_XDP support · Björn Töpel <hidden> · 2018-03-27

From: William Tu <hidden>
Date: 2018-03-28 00:07:32

On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
[off-list ref] wrote:

On Mon, 26 Mar 2018 14:58:02 -0700
William Tu [off-list ref] wrote:

quoted

Again high count for NMI ?!?

Maybe you just forgot to tell perf that you want it to decode the
bpf_prog correctly?

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols

Enable via:
 $ sysctl net/core/bpf_jit_kallsyms=1

And use perf report (while BPF is STILL LOADED):

 $ perf report --kallsyms=/proc/kallsyms

E.g. for emailing this you can use this command:

 $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40

Thanks, I followed the steps, the result of l2fwd
# Total Lost Samples: 119
#
# Samples: 2K of event 'cycles:ppp'
# Event count (approx.): 25675705627
#
# Overhead  CPU  Command  Shared Object       Symbol
# ........  ...  .......  ..................  ..................................
#
    10.48%  013  xdpsock  xdpsock             [.] main
     9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
     8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
     8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
     7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
     4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
     4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
     4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
     3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
     2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
     2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
     2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
     2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
     2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
     1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
     1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
     1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
     1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
     1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
     1.21%  013  xdpsock  xdpsock             [.] xq_enq
     1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova

You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.

Thanks, you're right. Let me dig more on this NMI behavior.

quoted

And l2fwd under "perf stat" looks OK to me. There is little context
switches, cpu is fully utilized, 1.17 insn per cycle seems ok.

Performance counter stats for 'CPU(s) 6':
  10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
            24      context-switches          #    0.002 K/sec
             0      cpu-migrations            #    0.000 K/sec
             0      page-faults               #    0.000 K/sec
22,361,333,647      cycles                    #    2.236 GHz
13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles idle
26,251,003,067      instructions              #    1.17  insn per cycle
                                              #    0.51  stalled cycles per insn
 4,938,921,868      branches                  #  493.853 M/sec
     7,591,739      branch-misses             #    0.15% of all branches
  10.000835769 seconds time elapsed

This perf stat also indicate something is wrong.

The 1.17 insn per cycle is NOT okay, it is too low (compared to what
usually I see, e.g. 2.36  insn per cycle).

It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
cycles idle'.   This means your CPU have issues/bottleneck fetching
instructions. Explained by Andi Kleen here [1]

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows

13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
idle     (23.82%)

  <not supported>      stalled-cycles-backend
        7,994,837      branch-misses           # 0.16% of all branches
         (23.80%)
      996,874,424      bus-cycles         # 99.679 M/sec          (23.80%)
   18,942,220,445      ref-cycles      # 1894.067 M/sec          (28.56%)
      100,983,226      LLC-loads         # 10.097 M/sec          (23.80%)
        4,897,089      LLC-load-misses           # 4.85% of all
LL-cache hits     (23.80%)
       66,659,889      LLC-stores          # 6.665 M/sec          (9.52%)
            8,373 LLC-store-misses          # 0.837 K/sec (9.52%)
      158,178,410      LLC-prefetches         # 15.817 M/sec          (9.52%)
        3,011,180      LLC-prefetch-misses       # 0.301 M/sec          (9.52%)
    8,190,383,109      dTLB-loads       # 818.971 M/sec          (9.52%)
       20,432,204      dTLB-load-misses          # 0.25% of all dTLB
cache hits   (9.52%)
    3,729,504,674      dTLB-stores       # 372.920 M/sec          (9.52%)
          992,231  dTLB-store-misses         # 0.099 M/sec          (9.52%)
  <not supported>      dTLB-prefetches
  <not supported>      dTLB-prefetch-misses
           11,619 iTLB-loads                # 0.001 M/sec (9.52%)
        1,874,756      iTLB-load-misses          # 16135.26% of all
iTLB cache hits  (14.28%)

I have super high iTLB-load-misses. This is probably the cause of high
frontend stalled.
Do you know any way to improve iTLB hit rate?

Thanks
William

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help