Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.

From: Kuniyuki Iwashima <hidden>
Date: 2025-03-25 17:54:25

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 25 Mar 2025 10:18:30 -0400

Kuniyuki Iwashima wrote:

quoted

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 15:44:40 -0400

quoted

Kuniyuki Iwashima wrote:

quoted

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 10:59:49 -0400

quoted

Kuniyuki Iwashima wrote:

quoted

__udp_enqueue_schedule_skb() has the following condition:

  if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
          goto drop;

sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
or SO_RCVBUFFORCE.

If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
as sk->sk_rmem_alloc is also signed int.

Then, the size of the incoming skb is added to sk->sk_rmem_alloc
unconditionally.

This results in integer overflow (possibly multiple times) on
sk->sk_rmem_alloc and allows a single socket to have skb up to
net.core.udp_mem[1].

For example, if we set a large value to udp_mem[1] and INT_MAX to
sk->sk_rcvbuf and flood packets to the socket, we can see multiple
overflows:

  # cat /proc/net/sockstat | grep UDP:
  UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
                                             ^- PAGE_SHIFT
  # ss -uam
  State  Recv-Q      ...
  UNCONN -1757018048 ...    <-- flipping the sign repeatedly
         skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)

Previously, we had a boundary check for INT_MAX, which was removed by
commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").

A complete fix would be to revert it and cap the right operand by
INT_MAX:

  rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
  if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
          goto uncharge_drop;

but we do not want to add the expensive atomic_add_return() back just
for the corner case.

So, let's perform the first check as unsigned int to detect the
integer overflow.

Note that we still allow a single wraparound, which can be observed
from userspace, but it's acceptable considering it's unlikely that
no recv() is called for a long period, and the negative value will
soon flip back to positive after a few recv() calls.

Can we do better than this?

Another approach I had in mind was to restore the original validation
under the recvq lock but without atomic ops like

  1. add another u32 as union of sk_rmem_alloc (only for UDP)
  2. access it with READ_ONCE() or under the recvq lock
  3. perform the validation under the lock

But it requires more changes around the error queue handling and
the general socket impl, so will be too invasive for net.git but
maybe worth a try for net-next ?

Definitely not net material. Adding more complexity here
would also need some convincing benchmark data probably.

quoted

Is this because of the "Always allow at least one packet" below, and
due to testing the value of the counter without skb->truesize added?

Yes, that's the reason although we don't receive a single >INT_MAX
packet.

I was surprised that we don't take the current skb size into
account when doing this calculation.

Turns out that this code used to do that.

commit 363dc73acacb ("udp: be less conservative with sock rmem
accounting") made this change:

-       if (rmem && (rmem + size > sk->sk_rcvbuf))
+       if (rmem > sk->sk_rcvbuf)
                goto drop;

The special consideration to allow one packet is to avoid starvation
with small rcvbuf, judging also from this review comment:

https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ (local)

Interesting, thanks for the info !

Now it's allowed to exceed by the total size of the incoming skb
on every CPUs, and a user may notice that rmem > rcvbuf via ss,
but I guess it's allowed because the fast recovery is expected.

quoted

That clearly doesn't apply when rcvbuf is near INT_MAX.
Can we separate the tiny budget case and hard drop including the
skb->truesize for normal buffer sizes?

Maybe like this ?

        if (rcvbuf < UDP_MIN_RCVBUF) {
                if (rmem > rcvbuf)
                        goto drop;
        } else {
                if (rmem + size > rcvbuf)
                        goto drop;
        }

SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
reported after that in 2016, so UDP_MIN_RCVBUF would be more ?

Since the only issue is the overflow, could use a higher bound like
INT_MAX >> 1.

quoted

But I wonder if adding new branches in the fast path is worth for
the corner case, and that's why I chose integrating the cast into
the exisintg branch, allowing a small overflow, which is observable
only when no thread calls recv() and skbs are queued more than INT_MAX.

Okay. Though it can probably be structured that the likely path does
not even see this?

Ah exactly, will use this and update commit message in v2.

Thanks!

    if (rmem + size > rcvbuf) {
            if (rcvbuf > INT_MAX << 1)
                    goto drop;
            if (rmem > rcvbuf)
                    goto drop;
    }

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help