Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
From: Kuniyuki Iwashima <hidden>
Date: 2025-03-25 17:54:25
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Tue, 25 Mar 2025 10:18:30 -0400
Kuniyuki Iwashima wrote:quoted
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Mon, 24 Mar 2025 15:44:40 -0400quoted
Kuniyuki Iwashima wrote:quoted
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Mon, 24 Mar 2025 10:59:49 -0400quoted
Kuniyuki Iwashima wrote:quoted
__udp_enqueue_schedule_skb() has the following condition: if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) goto drop; sk->sk_rcvbuf is initialised by net.core.rmem_default and later can be configured by SO_RCVBUF, which is limited by net.core.rmem_max, or SO_RCVBUFFORCE. If we set INT_MAX to sk->sk_rcvbuf, the condition is always false as sk->sk_rmem_alloc is also signed int. Then, the size of the incoming skb is added to sk->sk_rmem_alloc unconditionally. This results in integer overflow (possibly multiple times) on sk->sk_rmem_alloc and allows a single socket to have skb up to net.core.udp_mem[1]. For example, if we set a large value to udp_mem[1] and INT_MAX to sk->sk_rcvbuf and flood packets to the socket, we can see multiple overflows: # cat /proc/net/sockstat | grep UDP: UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 ^- PAGE_SHIFT # ss -uam State Recv-Q ... UNCONN -1757018048 ... <-- flipping the sign repeatedly skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) Previously, we had a boundary check for INT_MAX, which was removed by commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). A complete fix would be to revert it and cap the right operand by INT_MAX: rmem = atomic_add_return(size, &sk->sk_rmem_alloc); if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) goto uncharge_drop; but we do not want to add the expensive atomic_add_return() back just for the corner case. So, let's perform the first check as unsigned int to detect the integer overflow. Note that we still allow a single wraparound, which can be observed from userspace, but it's acceptable considering it's unlikely that no recv() is called for a long period, and the negative value will soon flip back to positive after a few recv() calls.Can we do better than this?Another approach I had in mind was to restore the original validation under the recvq lock but without atomic ops like 1. add another u32 as union of sk_rmem_alloc (only for UDP) 2. access it with READ_ONCE() or under the recvq lock 3. perform the validation under the lock But it requires more changes around the error queue handling and the general socket impl, so will be too invasive for net.git but maybe worth a try for net-next ?Definitely not net material. Adding more complexity here would also need some convincing benchmark data probably.quoted
quoted
Is this because of the "Always allow at least one packet" below, and due to testing the value of the counter without skb->truesize added?Yes, that's the reason although we don't receive a single >INT_MAX packet.I was surprised that we don't take the current skb size into account when doing this calculation. Turns out that this code used to do that. commit 363dc73acacb ("udp: be less conservative with sock rmem accounting") made this change: - if (rmem && (rmem + size > sk->sk_rcvbuf)) + if (rmem > sk->sk_rcvbuf) goto drop; The special consideration to allow one packet is to avoid starvation with small rcvbuf, judging also from this review comment: https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ (local)Interesting, thanks for the info ! Now it's allowed to exceed by the total size of the incoming skb on every CPUs, and a user may notice that rmem > rcvbuf via ss, but I guess it's allowed because the fast recovery is expected.quoted
That clearly doesn't apply when rcvbuf is near INT_MAX. Can we separate the tiny budget case and hard drop including the skb->truesize for normal buffer sizes?Maybe like this ? if (rcvbuf < UDP_MIN_RCVBUF) { if (rmem > rcvbuf) goto drop; } else { if (rmem + size > rcvbuf) goto drop; } SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was reported after that in 2016, so UDP_MIN_RCVBUF would be more ?Since the only issue is the overflow, could use a higher bound like INT_MAX >> 1.quoted
But I wonder if adding new branches in the fast path is worth for the corner case, and that's why I chose integrating the cast into the exisintg branch, allowing a small overflow, which is observable only when no thread calls recv() and skbs are queued more than INT_MAX.Okay. Though it can probably be structured that the likely path does not even see this?
Ah exactly, will use this and update commit message in v2. Thanks!
if (rmem + size > rcvbuf) {
if (rcvbuf > INT_MAX << 1)
goto drop;
if (rmem > rcvbuf)
goto drop;
}