Re: Multicast packet loss

Multicast packet loss · Kenny Chang <hidden> · 2009-01-30
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-01-30
Re: Multicast packet loss · Denys Fedoryschenko <hidden> · 2009-01-30
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-01-30
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-01-30
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-01-30
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-01-31
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-02
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-02
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-03
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-03
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-04
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-04
Re: Multicast packet loss · Wesley Chow <hidden> · 2009-02-04
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-04
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-05
Re: Multicast packet loss · Wesley Chow <hidden> · 2009-02-05
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-05
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-01
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-02
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-02
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-02
Re: Multicast packet loss · Wes Chow <hidden> · 2009-02-02
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-02
Re: Multicast packet loss · Wes Chow <hidden> · 2009-02-02
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-02
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-03
Re: Multicast packet loss · Neil Horman <nhorman@tuxdriver.com> · 2009-02-04
Re: Multicast packet loss · Kenny Chang <hidden> · 2009-02-26
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-28
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-01
Re: Multicast packet loss · David Miller <davem@davemloft.net> · 2009-03-04
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-04
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-07
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-08
Re: Multicast packet loss · David Miller <davem@davemloft.net> · 2009-03-09
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-09
Re: Multicast packet loss · David Miller <davem@davemloft.net> · 2009-03-13
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-13
Re: Multicast packet loss · David Miller <davem@davemloft.net> · 2009-03-13
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-13
[PATCH] net: reorder fields of struct socket · Eric Dumazet <hidden> · 2009-03-14
Re: [PATCH] net: reorder fields of struct socket · David Miller <davem@davemloft.net> · 2009-03-16
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-16
Re: Multicast packet loss · Peter Zijlstra <peterz@infradead.org> · 2009-03-17
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-17
Re: Multicast packet loss · Peter Zijlstra <peterz@infradead.org> · 2009-03-17
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-03-17
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-17
Re: Multicast packet loss · David Stevens <hidden> · 2009-03-17
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-17
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-04-03
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-04-05
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-04-06
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-04-06
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-04-07
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-04-08
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-03-09
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-10
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-03-10
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-11
Re: Multicast packet loss · Brian Bloniarz <hidden> · 2009-03-12
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-03-12
Re: Multicast packet loss · Christoph Lameter <hidden> · 2009-02-27
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-27
Re: Multicast packet loss · Christoph Lameter <hidden> · 2009-02-27
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-27
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-27
Re: Multicast packet loss · Eric Dumazet <hidden> · 2009-02-02

From: Peter Zijlstra <peterz@infradead.org>
Date: 2009-03-17 10:12:14

On Mon, 2009-03-16 at 23:22 +0100, Eric Dumazet wrote:

Here is the last incantation of the patch, that of course should be
split in two parts and better Changelog for further discussion on lkml.

I read the entire thread up to now, and I still don't really understand
the Changelog, sorry :(

[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues in time and we get frame losses and high
latencies.

This patch adds an infrastructure to delay work done in
sock_def_readable() at end of do_softirq(). This needs to
make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS

How does that solve the wakeup issue?

Signed-off-by: Eric Dumazet <redacted>
---

quoted hunk ↗ jump to hunk

--- a/kernel/softirq.c
+++ b/kernel/softirq.c

@@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);
 
+
+#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
+static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
+	SOFTIRQ_DELAY_END
+};

Why the magic termination value? Can't we NULL terminate the list

+
+/*
+ * Caller must disable preemption, and take care of appropriate
+ * locking and refcounting
+ */

Shouldn't we call it __softirq_delay_queue() if the caller needs to
disabled preemption?

Futhermore, don't we always require the caller to take care of lifetime
issues when we queue something?

+int softirq_delay_queue(struct softirq_delay *sdel)
+{
+	if (!sdel->next) {
+		sdel->next = __get_cpu_var(softirq_delay_head);
+		__get_cpu_var(softirq_delay_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Because locking is provided by subsystem, please note
+ * that sdel->func(sdel) is responsible for setting sdel->next to NULL
+ */
+static void softirq_delay_exec(void)
+{
+	struct softirq_delay *sdel;
+
+	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
+		__get_cpu_var(softirq_delay_head) = sdel->next;
+		sdel->func(sdel);	/*	sdel->next = NULL;*/
+		}
+}

Why can't we write:

  struct softirq_delay *sdel, *next;

  sdel = __get_cpu_var(softirq_delay_head);
  __get_cpu_var(softirq_delay_head) = NULL;

  while (sdel) {
    next = sdel->next;
    sdel->func(sdel);
    sdel = next;
  }

Why does it matter what happens to sdel->next? we've done the callback.

Aah, the crux is in the re-use policy.. that most certainly does deserve
a comment.

How about we make sdel->next point to itself in the init case?

Then we can write:

  while (sdel) {
    next = sdel->next;
    sdel->next = sdel;
    sdel->func(sdel);
    sdel = next;
  }

and have the enqueue bit look like:

int __softirq_delay_queue(struct softirq_delay *sdel)
{
  struct softirq_delay **head;

  if (sdel->next != sdel)
    return 0;

  head = &__get_cpu_var(softirq_delay_head);
  sdel->next = *head;
  *head = sdel;
  return 1;
}

quoted hunk ↗ jump to hunk

@@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+/*
+ * helper function called by softirq_delay_exec(),
+ * if inet_def_readable() queued us.
+ */
+static void sock_readable_defer(struct softirq_delay *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_delay);
+
+	sdel->next = NULL;
+	/*
+	 * At this point, we dont own a lock on socket, only a reference.
+	 * We must commit above write, or another cpu could miss a wakeup
+	 */
+	smp_wmb();

Where's the matching barrier?

+	sock_def_readable(sk, 0);
+	sock_put(sk);
+}
+
+/*
+ * Custom version of sock_def_readable()
+ * We want to defer scheduler processing at the end of do_softirq()
+ * Called with socket locked.
+ */
+void inet_def_readable(struct sock *sk, int len)
+{
+	if (running_from_softirq()) {
+		if (softirq_delay_queue(&sk->sk_delay))
+			/*
+			 * If we queued this socket, take a reference on it
+			 * Caller owns socket lock, so write to sk_delay.next
+			 * will be committed before unlock.
+			 */
+			sock_hold(sk);
+	} else
+		sock_def_readable(sk, len);
+}

OK, so the idea is to handle a bunch of packets and instead of waking N
threads for each packet, only wake them once at the end of the batch?

Sounds like a sensible idea..

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help