Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
From: Eric Dumazet <hidden>
Date: 2014-06-23 08:33:38
Subsystem:
networking [general], networking [ipv4/ipv6], networking [sockets], the rest · Maintainers:
"David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, David Ahern, Ido Schimmel, Kuniyuki Iwashima, Willem de Bruijn, Linus Torvalds
On Sun, 2014-06-22 at 12:07 -0700, dormando wrote:
Update on testing: I only have two machines that crash on their own frequently (more like one, even). Unfortunately something happened to the datacenter it's in and it was offline for a week. The machine normally crashes after 1.5-4d, averaging 2d. It's done about three days total time without a new crash. I also have the kernel running in another datacenter for ~10 days.. but it takes 30-150 days to crash in that one. So, inconclusive, but still promising. If the machine survives the week it probably means it's fixed, or at least greatly reduced. I saw that one of your patches got queued for stable, but all three were necessary to fix udpkill. What's your plan for cleanup/upstreaming? Did you folks end up running udpkill under the tester thing?
I did not test udpkill, as the known problem is the DST_NOCACHE flag. We end up calling sk_dst_set(sk, dst) with a dst having this flag set. So maybe DST_NOCACHE should be renamed, if we _can _ cache a dst like this. Its meaning is really that dst_release() has to track when refcount reaches 0 so that last owner fress dst, but we need to respect rcu grace period. Fixing sk_dst_set() as I did is not enough, as it is only reducing race window. Something like following : include/net/sock.h | 4 ++-- net/core/dst.c | 16 +++++++++++----- net/ipv4/ip_tunnel.c | 7 +------ 3 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 07b7fcd60d80..173cae485de1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h@@ -1730,8 +1730,8 @@ sk_dst_get(struct sock *sk) rcu_read_lock(); dst = rcu_dereference(sk->sk_dst_cache); - if (dst) - dst_hold(dst); + if (dst && !atomic_inc_not_zero(&dst->__refcnt)) + dst = NULL; rcu_read_unlock(); return dst; }
diff --git a/net/core/dst.c b/net/core/dst.c
index 80d6286c8b62..a028409ee438 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c@@ -269,6 +269,15 @@ again: } EXPORT_SYMBOL(dst_destroy); +static void dst_destroy_rcu(struct rcu_head *head) +{ + struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head); + + dst = dst_destroy(dst); + if (dst) + __dst_free(dst); +} + void dst_release(struct dst_entry *dst) { if (dst) {
@@ -276,11 +285,8 @@ void dst_release(struct dst_entry *dst) newrefcnt = atomic_dec_return(&dst->__refcnt); WARN_ON(newrefcnt < 0); - if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) { - dst = dst_destroy(dst); - if (dst) - __dst_free(dst); - } + if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) + call_rcu(&dst->rcu_head, dst_destroy_rcu); } } EXPORT_SYMBOL(dst_release);
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 097b3e7c1e8f..cc3b7fd34555 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c@@ -73,12 +73,7 @@ static void __tunnel_dst_set(struct ip_tunnel_dst *idst, { struct dst_entry *old_dst; - if (dst) { - if (dst->flags & DST_NOCACHE) - dst = NULL; - else - dst_clone(dst); - } + dst_clone(dst); old_dst = xchg((__force struct dst_entry **)&idst->dst, dst); dst_release(old_dst); }