Re: [PATCH 2/2] udp: RCU handling for Unicast packets.
From: Paul E. McKenney <hidden>
Date: 2008-10-29 20:18:01
On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
Paul E. McKenney a écrit :quoted
On Wed, Oct 29, 2008 at 01:28:15PM -0500, Corey Minyard wrote:quoted
Eric Dumazet wrote:quoted
Corey Minyard a écrit :quoted
Paul E. McKenney wrote:quoted
On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:quoted
Corey Minyard a écrit :quoted
Eric Dumazet wrote:quoted
Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d (udp: RCU handling for Unicast packets.) "If the socket is moved from one list to another list in-between the time the hash is calculated and the next field is accessed, and the socket has moved to the end of the new list, the traversal will not complete properly on the list it should have, since the socket will be on the end of the new list and there's not a way to tell it's on a new list and restart the list traversal. I think that this can be solved by pre-fetching the "next" field (with proper barriers) before checking the hash." This patch corrects this problem, introducing a new sk_for_each_rcu_safenext() macro.You also need the appropriate smp_wmb() in udp_lib_get_port() after sk_hash is set, I think, so the next field is guaranteed to be changed after the hash value is changed.Not sure about this one Corey. If a reader catches previous value of item->sk_hash, two cases are to be taken into : 1) its udp_hashfn(net, sk->sk_hash) is != hash -> goto begin : Reader will redo its scan 2) its udp_hashfn(net, sk->sk_hash) is == hash -> next pointer is good enough : it points to next item in same hash chain. No need to rescan the chain at this point. Yes we could miss the fact that a new port was bound and this UDP message could be lost.3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was removed, freed, reallocated, and then readded with the same hash value, possibly carrying the reader to a new position in the same list.If I understand this, without the smp_wmb(), it is possible that the next field can be written to main memory before the hash value is written. If that happens, the following can occur: CPU1 CPU2 next is set to NULL (end of new list)Well, if this item is injected to the same chain, next wont be set to NULL. That would mean previous writers deleted all items from the chain.I put my comment in the wrong place, I wasn't talking about being injected into the same chain.quoted
In this case, readers can see NULL, it is not a problem at all. List is/was empty. An application cannot complain a packet is not handled if its bind() syscall is not yet completed :) If item is injected on another chain, we will detect hash mismatch and redo full scan.If the item is injected onto the end of another chain, the next field will be NULL and you won't detect a hash mismatch. It's basically the same issue as the previous race, though a lot more subtle and unlikely. If you get (from the previous socket) an old value of "sk_hash" and (from the new socket) a new value of "next" that is NULL, you will terminate the loop when you should have restarted it. I'm pretty sure that can occur without the write barrier.One way of dealing with this is to keep a tail pointer. Then, if the element containing the NULL pointer doesn't match the tail pointer seen at the start of the search, or if the tail pointer has changed, restart the search. Memory barriers will be needed. ;-)Hum... Another way of handling all those cases and avoid memory barriers would be to have different "NULL" pointers. Each hash chain should have a unique "NULL" pointer (in the case of UDP, it can be the 128 values : [ (void*)0 .. (void *)127 ] Then, when performing a lookup, a reader should check the "NULL" pointer it get at the end of its lookup has is the "hash" value of its chain. If not -> restart the loop, aka "goto begin;" :) We could avoid memory barriers then. In the two cases Corey mentioned, this trick could let us avoid memory barriers. (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough) What do you think ?
Kinky!!! ;-) Then the rcu_dereference() would be supplying the needed memory barriers. Hmmm... I guess that the only confusion would be if the element got removed and then added to the same list. But then if its pointer was pseudo-NULL, then that would mean that all subsequent elements had been removed, and all preceding ones added after the scan started. Which might well be harmless, but I must defer to you on this one at the moment. If you need a larger hash table, another approach would be to set the pointer's low-order bit, allowing the upper bits to be a full-sized index -- or even a pointer to the list header. Just make very sure to clear the pointer when freeing, or an element on the freelist could end up looking like a legitimate end of list... Which again might well be safe, but why inflict this on oneself? Thanx, Paul