Thread (69 messages) 69 messages, 10 authors, 2009-04-08

Re: Multicast packet loss

From: Eric Dumazet <hidden>
Date: 2009-02-02 16:57:35

Neil Horman a écrit :
On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
quoted
Eric Dumazet a écrit :
quoted
Kenny Chang a écrit :
quoted
Ah, sorry, here's the test program attached.

We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
2.6.29.-rcX.

Right now, we are trying to step through the kernel versions until we
see where the performance drops significantly.  We'll try 2.6.29-rc soon
and post the result.
I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)

With 8 clients, about 10% packet loss, 

Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
wakeups per second... But at least UDP receive path seems OK.

Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
readers...

I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)

I tried playing with ethtool -C|c G|g params...
And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)

I can have 0% packet loss if booting with msi_disable and

echo 1 >/proc/irq/16/smp_affinities

(16 being interrupt of eth0 NIC)

then, a second run gave me errors, about 2%, oh well...


oprofile numbers without playing IRQ affinities:

CPU: Core 2, speed 2999.89 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
327928   10.1427  schedule
259625    8.0301  mwait_idle
187337    5.7943  __skb_recv_datagram
109854    3.3977  lock_sock_nested
104713    3.2387  tick_nohz_stop_sched_tick
98831     3.0568  select_nohz_load_balancer
88163     2.7268  skb_release_data
78552     2.4296  update_curr
75241     2.3272  getnstimeofday
71400     2.2084  set_next_entity
67629     2.0917  get_next_timer_interrupt
67375     2.0839  sched_clock_tick
58112     1.7974  enqueue_entity
56462     1.7463  udp_recvmsg
55049     1.7026  copy_to_user
54277     1.6788  sched_clock_cpu
54031     1.6712  __copy_skb_header
51859     1.6040  __slab_free
51786     1.6017  prepare_to_wait_exclusive
51776     1.6014  sock_def_readable
50062     1.5484  try_to_wake_up
42182     1.3047  __switch_to
41631     1.2876  read_tsc
38337     1.1857  tick_nohz_restart_sched_tick
34358     1.0627  cpu_idle
34194     1.0576  native_sched_clock
33812     1.0458  pick_next_task_fair
33685     1.0419  resched_task
33340     1.0312  sys_recvfrom
33287     1.0296  dst_release
32439     1.0033  kmem_cache_free
32131     0.9938  hrtimer_start_range_ns
29807     0.9219  udp_queue_rcv_skb
27815     0.8603  task_rq_lock
26875     0.8312  __update_sched_clock
23912     0.7396  sock_queue_rcv_skb
21583     0.6676  __wake_up_sync
21001     0.6496  effective_load
20531     0.6350  hrtick_start_fair




With IRQ affinities and msi_disable (no packet drops)

CPU: Core 2, speed 3000.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
79788    10.3815  schedule
69422     9.0328  mwait_idle
44877     5.8391  __skb_recv_datagram
28629     3.7250  tick_nohz_stop_sched_tick
27252     3.5459  select_nohz_load_balancer
24320     3.1644  lock_sock_nested
20833     2.7107  getnstimeofday
20666     2.6889  skb_release_data
18612     2.4217  set_next_entity
17785     2.3141  get_next_timer_interrupt
17691     2.3018  udp_recvmsg
17271     2.2472  sched_clock_tick
16032     2.0860  copy_to_user
14785     1.9237  update_curr
12512     1.6280  prepare_to_wait_exclusive
12498     1.6262  __slab_free
11380     1.4807  read_tsc
11145     1.4501  sched_clock_cpu
10598     1.3789  __switch_to
9588      1.2475  pick_next_task_fair
9480      1.2335  cpu_idle
9218      1.1994  sys_recvfrom
9008      1.1721  tick_nohz_restart_sched_tick
8977      1.1680  dst_release
8930      1.1619  native_sched_clock
8392      1.0919  kmem_cache_free
8124      1.0570  hrtimer_start_range_ns
7274      0.9464  bnx2_interrupt
7175      0.9336  __copy_skb_header
7006      0.9116  try_to_wake_up
6949      0.9042  sock_def_readable
6787      0.8831  enqueue_entity
6772      0.8811  __update_sched_clock
6349      0.8261  finish_task_switch
6164      0.8020  copy_from_user
5096      0.6631  resched_task
5007      0.6515  sysenter_past_esp


I will try to investigate a litle bit more in following days if time permits.
I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
affinity when operating in msi interrupt mode.  If this is the case with this
particular bnx2 card, then I would expect some packet loss, simply due to the
constant cache misses.  It would be interesting to re-run your oprofile cases,
counting L2 cache hits/misses (if your cpu supports that class of counter) for
both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
interesting to use a different card, that can set irq affinity, and compare loss
with irqbalance on, and irqbalance off with irq afninty set to all cpus.
booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
oprofile results sorted on CPU0 numbers.

We can see scheduler has hard time to cope with this workload with more of two CPUS

OK up to 30.000 (* 8 sockets) packets per second. 

CPU0 is 100% handling softirq (ksoftirqd/0)


CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
6152     12.5595  3         0.0098  3         0.0090  5         0.0156  1         0.0582  0              0  2         0.0065  3         0.0169  enqueue_entity
4453      9.0909  2         0.0065  3         0.0090  4         0.0125  5         0.2910  0              0  1         0.0033  2         0.0113  try_to_wake_up
3837      7.8333  3         0.0098  8         0.0241  0              0  0              0  0              0  0              0  0              0  sock_def_readable
3694      7.5414  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __copy_skb_header
2320      4.7363  1         0.0033  2         0.0060  2         0.0062  1         0.0582  1         0.0028  2         0.0065  0              0  resched_task
1818      3.7115  6         0.0196  32        0.0962  0              0  0              0  0              0  0              0  0              0  sock_queue_rcv_skb
1776      3.6257  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_queue_rcv_skb
1677      3.4236  0              0  1         0.0030  0              0  1         0.0582  1         0.0028  0              0  0              0  __slab_alloc
1658      3.3848  260       0.8496  303       0.9109  289       0.9021  24        1.3970  418       1.1730  326       1.0626  173       0.9733  sched_clock_cpu
1614      3.2950  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_sync
1600      3.2664  0              0  1         0.0030  0              0  1         0.0582  1         0.0028  0              0  0              0  select_task_rq_fair
1569      3.2032  1299      4.2447  1530      4.5996  1271      3.9675  6         0.3492  1677      4.7062  1275      4.1559  759       4.2703  update_curr
1532      3.1276  4         0.0131  4         0.0120  0              0  2         0.1164  1         0.0028  1         0.0033  1         0.0056  task_rq_lock
1325      2.7050  1         0.0033  7         0.0210  0              0  0              0  0              0  0              0  0              0  skb_queue_tail
1273      2.5989  1         0.0033  1         0.0030  1         0.0031  0              0  0              0  1         0.0033  0              0  enqueue_task_fair
1227      2.5050  0              0  0              0  0              0  0              0  0              0  0              0  0              0  effective_load
1071      2.1865  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __udp4_lib_rcv
1009      2.0599  0              0  0              0  0              0  2         0.1164  1         0.0028  0              0  0              0  activate_task
940       1.9190  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_common
930       1.8986  0              0  2         0.0060  0              0  0              0  0              0  0              0  1         0.0056  account_scheduler_latency
859       1.7537  0              0  0              0  0              0  0              0  0              0  0              0  1         0.0056  __skb_clone
609       1.2433  0              0  0              0  0              0  1         0.0582  1         0.0028  1         0.0033  0              0  enqueue_task
588       1.2004  3         0.0098  2         0.0060  5         0.0156  8         0.4657  2         0.0056  3         0.0098  2         0.0113  kmem_cache_alloc
477       0.9738  307       1.0032  322       0.9680  358       1.1175  27        1.5716  338       0.9485  315       1.0268  203       1.1421  native_sched_clock
441       0.9003  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_clone
408       0.8329  0              0  0              0  0              0  0              0  0              0  0              0  0              0  ip_route_input
375       0.7656  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_poll_work
366       0.7472  248       0.8104  269       0.8087  293       0.9146  22        1.2806  289       0.8110  332       1.0822  157       0.8833  __update_sched_clock
327       0.6676  1         0.0033  0              0  0              0  0              0  0              0  2         0.0065  1         0.0056  place_entity
265       0.5410  54        0.1765  62        0.1864  39        0.1217  3         0.1746  84        0.2357  61        0.1988  12        0.0675  rb_insert_color
194       0.3961  2662      8.6985  3291      9.8936  3231     10.0858  372      21.6531  2994      8.4021  3299     10.7533  1719      9.6714  mwait_idle



This problem completely disapears if I launch all clients bounded to CPU1
(and NIC irq still on CPU0)

taskset -p2 ./mcasttest.sh

(No packet loss, while CPU1 has 0% idle time...)
We have less context switches. Once awakened, a task can read several packets in its socket.

CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
25316    13.6664  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __copy_skb_header
14584     7.8729  13        0.0083  2         0.2670  0              0  0              0  1         0.0218  1         0.1577  1         0.2604  task_rq_lock
11624     6.2750  10657     6.7650  3         0.4005  2         0.5013  5         0.7278  36        0.7862  2         0.3155  1         0.2604  update_curr
10038     5.4188  318       0.2019  0              0  0              0  0              0  0              0  0              0  0              0  sock_def_readable
10021     5.4097  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_interrupt
7777      4.1983  11        0.0070  1         0.1335  2         0.5013  2         0.2911  8         0.1747  3         0.4732  1         0.2604  try_to_wake_up
6559      3.5408  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_queue_rcv_skb
6389      3.4490  257       0.1631  0              0  0              0  0              0  0              0  0              0  0              0  sock_queue_rcv_skb
6305      3.4036  6         0.0038  0              0  0              0  0              0  0              0  0              0  1         0.2604  __slab_alloc
5661      3.0560  44        0.0279  2         0.2670  1         0.2506  5         0.7278  0              0  1         0.1577  0              0  kmem_cache_alloc
5529      2.9847  5         0.0032  1         0.1335  4         1.0025  0              0  14        0.3057  3         0.4732  1         0.2604  enqueue_entity
4706      2.5404  64        0.0406  0              0  0              0  0              0  0              0  0              0  0              0  skb_queue_tail
4390      2.3699  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __udp4_lib_rcv
4043      2.1825  0              0  0              0  0              0  0              0  0              0  0              0  0              0  uhci_irq
3897      2.1037  0              0  0              0  0              0  0              0  0              0  0              0  0              0  bnx2_poll_work
3556      1.9196  0              0  296      39.5194  261      65.4135  258      37.5546  650      14.1952  263      41.4826  257      66.9271  mwait_idle
3449      1.8619  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __skb_clone
3348      1.8074  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_clone
3243      1.7507  1653      1.0493  7         0.9346  3         0.7519  2         0.2911  63        1.3758  1         0.1577  1         0.2604  sched_clock_cpu
3068      1.6562  1        6.3e-04  0              0  0              0  0              0  0              0  0              0  0              0  __wake_up_sync
2923      1.5779  1        6.3e-04  0              0  0              0  0              0  0              0  0              0  0              0  check_preempt_wakeup
2588      1.3971  1        6.3e-04  4         0.5340  2         0.5013  1         0.1456  1         0.0218  2         0.3155  0              0  enqueue_task_fair
2399      1.2951  0              0  0              0  0              0  0              0  0              0  0              0  0              0  ip_route_input
1986      1.0721  5         0.0032  0              0  0              0  0              0  1         0.0218  0              0  0              0  __wake_up_common
1777      0.9593  132       0.0838  0              0  0              0  3         0.4367  8         0.1747  1         0.1577  1         0.2604  rb_insert_color
1754      0.9469  34        0.0216  0              0  0              0  0              0  0              0  0              0  0              0  __sk_mem_schedule
1550      0.8367  0              0  0              0  0              0  0              0  0              0  0              0  0              0  irq_entries_start
1527      0.8243  0              0  13        1.7356  2         0.5013  0              0  104       2.2712  6         0.9464  0              0  get_next_timer_interrupt
1398      0.7547  6         0.0038  0              0  0              0  1         0.1456  1         0.0218  0              0  0              0  select_task_rq_fair
1159      0.6257  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __alloc_skb


Exchanging CPU0/CPU1 to get oprofile number sorted on the CPU used by user application :

CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
6040     10.1815  8         0.0134  4         0.3208  6         1.3699  1         0.1580  3         1.0909  5         0.6039  0              0  schedule
6014     10.1377  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __skb_recv_datagram
4730      7.9733  36        0.0604  0              0  0              0  0              0  0              0  0              0  0              0  skb_release_data
4014      6.7663  2         0.0034  33        2.6464  1         0.2283  6         0.9479  0              0  1         0.1208  0              0  copy_to_user
3732      6.2910  3708      6.2167  4         0.3208  1         0.2283  3         0.4739  3         1.0909  2         0.2415  0              0  update_curr
3446      5.8089  0              0  0              0  0              0  0              0  0              0  0              0  0              0  lock_sock_nested
2430      4.0962  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_recvmsg
2028      3.4186  73        0.1224  0              0  0              0  2         0.3160  0              0  1         0.1208  1         0.3861  __slab_free
1898      3.1994  43        0.0721  0              0  0              0  0              0  0              0  0              0  0              0  dst_release
1645      2.7730  0              0  0              0  0              0  0              0  0              0  0              0  0              0  memcpy_toiovec
1635      2.7561  0              0  0              0  0              0  1         0.1580  0              0  0              0  0              0  copy_from_user
1407      2.3718  2         0.0034  0              0  2         0.4566  6         0.9479  0              0  5         0.6039  1         0.3861  sysenter_past_esp
1389      2.3414  58        0.0972  3         0.2406  3         0.6849  4         0.6319  2         0.7273  3         0.3623  0              0  kmem_cache_free
1135      1.9133  1         0.0017  0              0  0              0  0              0  0              0  0              0  0              0  release_sock
1069      1.8020  0              0  0              0  0              0  0              0  0              0  0              0  0              0  prepare_to_wait_exclusive
1031      1.7379  3         0.0050  0              0  0              0  0              0  0              0  1         0.1208  0              0  put_prev_task_fair
1007      1.6975  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sock_rfree
926       1.5609  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sys_recvfrom
838       1.4126  0              0  0              0  0              0  0              0  0              0  0              0  0              0  skb_copy_datagram_iovec
697       1.1749  27        0.0453  1         0.0802  0              0  0              0  1         0.3636  0              0  0              0  add_partial
697       1.1749  0              0  0              0  0              0  0              0  0              0  1         0.1208  0              0  dequeue_task
604       1.0182  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sock_recvmsg
582       0.9811  933       1.5642  17        1.3633  1         0.2283  2         0.3160  1         0.3636  6         0.7246  0              0  sched_clock_cpu
525       0.8850  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __sk_mem_reclaim
512       0.8631  0              0  0              0  1         0.2283  0              0  0              0  1         0.1208  1         0.3861  __switch_to
489       0.8243  0              0  0              0  0              0  0              0  0              0  0              0  0              0  fget_light
450       0.7586  1         0.0017  2         0.1604  1         0.2283  0              0  0              0  4         0.4831  0              0  rb_erase
409       0.6894  95        0.1593  0              0  0              0  0              0  0              0  0              0  0              0  wakeup_preempt_entity
360       0.6068  0              0  0              0  0              0  0              0  0              0  0              0  0              0  move_addr_to_user
348       0.5866  0              0  0              0  0              0  0              0  0              0  0              0  0              0  sys_socketcall
347       0.5849  1         0.0017  0              0  0              0  0              0  0              0  0              0  1         0.3861  set_next_entity
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help