Re: Multicast packet loss
From: Peter Zijlstra <peterz@infradead.org>
Date: 2009-03-17 10:12:14
On Mon, 2009-03-16 at 23:22 +0100, Eric Dumazet wrote:
Here is the last incantation of the patch, that of course should be split in two parts and better Changelog for further discussion on lkml.
I read the entire thread up to now, and I still don't really understand the Changelog, sorry :(
[PATCH] softirq: Introduce mechanism to defer wakeups Some network workloads need to call scheduler too many times. For example, each received multicast frame can wakeup many threads. ksoftirqd is then not able to drain NIC RX queues in time and we get frame losses and high latencies. This patch adds an infrastructure to delay work done in sock_def_readable() at end of do_softirq(). This needs to make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS
How does that solve the wakeup issue?
Signed-off-by: Eric Dumazet <redacted> ---
quoted hunk ↗ jump to hunk
--- a/kernel/softirq.c +++ b/kernel/softirq.c@@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip) } EXPORT_SYMBOL(local_bh_enable_ip); + +#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L +static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = { + SOFTIRQ_DELAY_END +};
Why the magic termination value? Can't we NULL terminate the list
+ +/* + * Caller must disable preemption, and take care of appropriate + * locking and refcounting + */
Shouldn't we call it __softirq_delay_queue() if the caller needs to disabled preemption? Futhermore, don't we always require the caller to take care of lifetime issues when we queue something?
+int softirq_delay_queue(struct softirq_delay *sdel)
+{
+ if (!sdel->next) {
+ sdel->next = __get_cpu_var(softirq_delay_head);
+ __get_cpu_var(softirq_delay_head) = sdel;
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * Because locking is provided by subsystem, please note
+ * that sdel->func(sdel) is responsible for setting sdel->next to NULL
+ */
+static void softirq_delay_exec(void)
+{
+ struct softirq_delay *sdel;
+
+ while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
+ __get_cpu_var(softirq_delay_head) = sdel->next;
+ sdel->func(sdel); /* sdel->next = NULL;*/
+ }
+}
Why can't we write:
struct softirq_delay *sdel, *next;
sdel = __get_cpu_var(softirq_delay_head);
__get_cpu_var(softirq_delay_head) = NULL;
while (sdel) {
next = sdel->next;
sdel->func(sdel);
sdel = next;
}
Why does it matter what happens to sdel->next? we've done the callback.
Aah, the crux is in the re-use policy.. that most certainly does deserve
a comment.
How about we make sdel->next point to itself in the init case?
Then we can write:
while (sdel) {
next = sdel->next;
sdel->next = sdel;
sdel->func(sdel);
sdel = next;
}
and have the enqueue bit look like:
int __softirq_delay_queue(struct softirq_delay *sdel)
{
struct softirq_delay **head;
if (sdel->next != sdel)
return 0;
head = &__get_cpu_var(softirq_delay_head);
sdel->next = *head;
*head = sdel;
return 1;
}
quoted hunk ↗ jump to hunk
@@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len) read_unlock(&sk->sk_callback_lock); } +/* + * helper function called by softirq_delay_exec(), + * if inet_def_readable() queued us. + */ +static void sock_readable_defer(struct softirq_delay *sdel) +{ + struct sock *sk = container_of(sdel, struct sock, sk_delay); + + sdel->next = NULL; + /* + * At this point, we dont own a lock on socket, only a reference. + * We must commit above write, or another cpu could miss a wakeup + */ + smp_wmb();
Where's the matching barrier?
+ sock_def_readable(sk, 0);
+ sock_put(sk);
+}
+
+/*
+ * Custom version of sock_def_readable()
+ * We want to defer scheduler processing at the end of do_softirq()
+ * Called with socket locked.
+ */
+void inet_def_readable(struct sock *sk, int len)
+{
+ if (running_from_softirq()) {
+ if (softirq_delay_queue(&sk->sk_delay))
+ /*
+ * If we queued this socket, take a reference on it
+ * Caller owns socket lock, so write to sk_delay.next
+ * will be committed before unlock.
+ */
+ sock_hold(sk);
+ } else
+ sock_def_readable(sk, len);
+}OK, so the idea is to handle a bunch of packets and instead of waking N threads for each packet, only wake them once at the end of the batch? Sounds like a sensible idea..