Re: [PATCH RFC net-next 1/4] tcp: implement RFC 7323 window retraction receiver requirements
From: Stefano Brivio <hidden>
Date: 2026-02-25 21:33:43
Also in:
linux-doc, linux-kselftest, lkml
On Tue, 24 Feb 2026 19:07:45 +0100 Simon Baatz [off-list ref] wrote:
Hi Stefano, On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:quoted
Hi Simon, It all makes sense to me at a quick look, I have just some nits and one more substantial worry, below: On Fri, 20 Feb 2026 00:55:14 +0100 Simon Baatz via B4 Relay [off-list ref] wrote:quoted
From: Simon Baatz <redacted> By default, the Linux TCP implementation does not shrink the advertised window (RFC 7323 calls this "window retraction") with the following exceptions: - When an incoming segment cannot be added due to the receive buffer running out of memory. Since commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") a zero window will be advertised in this case. It turns out that reaching the required "memory pressure" is very easy when window scaling is in use. In the simplest case, sending a sufficient number of segments smaller than the scale factor to a receiver that does not read data is enough. Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this happens much earlier than before, leading to regressions (the test suite of the Valkey project does not pass because of a TCP connection that is no longer bi-directional).Ouch. By the way, that same commit helped us unveil an issue (at least in the sense of RFC 9293, 3.8.6) we fixed in passt: https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3This looks concerning: It seems as if just filling the advertised window triggered the out of memory condition(?).
Right, even if it's not so much a general "out of memory" condition:
it's just that the socket might simply refuse to queue more data at
that point (we run out of window space, rather than memory).
Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), we will even get zero-window updates in that case. Jon
raised the issue here:
https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@redhat.com/ (local)
but it was not really fixed. Anyway:
Am I right in assuming that this happened with the original 1d2fbaad7cd8, not the relaxed version of tcp_can_ingest() from f017c1f768b?
...you're right. I wasn't even aware of f017c1f768b, thanks for pointing that out. That seems to make things saner, and I don't expect further issues at this point. By the way of which, passt struggled talking to applications entirely written in the 21st century. That's socat, I think started in 2001, being used in Podman tests, and its only SO_RCVBUF-related fault is that it uses the default 208 KiB value (from rmem_default) as a starting value by... not doing anything. Applications can set SO_RCVBUF and SO_SNDBUF to bigger values (depending on rmem_max and wmem_max), but if they do, automatic tuning of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is disabled. We used to do that in passt itself, and I eventually dropped it here: https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06 because we might really need automatic tuning and the resulting big buffers for high latency, high throughput connections.
quoted
quoted
- Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink") addressed the "eating memory" problem by introducing a sysctl knob that allows shrinking the window before running out of memory. However, RFC 7323 does not only state that shrinking the window is necessary in some cases, it also formulates requirements for TCP implementations when doing so (Section 2.4). This commit addresses the receiver-side requirements: After retracting the window, the peer may have a snd_nxt that lies within a previously advertised window but is now beyond the retracted window. This means that all incoming segments (including pure ACKs) will be rejected until the application happens to read enough data to let the peer's snd_nxt be in window again (which may be never). To comply with RFC 7323, the receiver MUST honor any segment that would have been in window for any ACK sent by the receiver and, when window scaling is in effect, SHOULD track the maximum window sequence number it has advertised. This patch tracks that maximum window sequence number throughout the connection and uses it in tcp_sequence() when deciding whether a segment is acceptable. Acceptability of data is not changed. Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink") Signed-off-by: Simon Baatz <redacted> --- Documentation/networking/net_cachelines/tcp_sock.rst | 1 + include/linux/tcp.h | 1 + include/net/tcp.h | 14 ++++++++++++++ net/ipv4/tcp_fastopen.c | 1 + net/ipv4/tcp_input.c | 6 ++++-- net/ipv4/tcp_minisocks.c | 1 + net/ipv4/tcp_output.c | 12 ++++++++++++ .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +- 8 files changed, 35 insertions(+), 3 deletions(-)diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644 --- a/Documentation/networking/net_cachelines/tcp_sock.rst +++ b/Documentation/networking/net_cachelines/tcp_sock.rst@@ -121,6 +121,7 @@ u64 delivered_mstamp read_write u32 rate_delivered read_mostly tcp_rate_gen u32 rate_interval_us read_mostly rate_delivered,rate_app_limited u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check +u32 rcv_mwnd_seq read_write tcp_select_window u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push u32 notsent_lowat read_mostly tcp_stream_memory_free u32 pushed_seq read_write tcp_mark_push,forced_pushdiff --git a/include/linux/tcp.h b/include/linux/tcp.h index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h@@ -271,6 +271,7 @@ struct tcp_sock { u32 lsndtime; /* timestamp of last sent data packet (for restart window) */ u32 mdev_us; /* medium deviation */ u32 rtt_seq; /* sequence number to update rttvar */ + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */Nit: tab between ; and /* for consistency (I would personally prefer the comment style as you see on 'highest_sack' but I don't think it's enforced anymore).Thanks, I missed that.quoted
Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading here because the relevant paragraph there covers a very specific case of window retraction, caused by quantisation error from window scaling, which is not the most common case here. I couldn't quickly find a better reference though.I agree, but there is a part that, I think, is more generally applicable: 2.4. Addressing Window Retraction [ specific window retraction case introduction removed ] ... Implementations MUST ensure that they handle a shrinking window, as specified in Section 4.2.2.16 of [RFC1122]. For the receiver, this implies that: 1) The receiver MUST honor, as in window, any segment that would have been in window for any <ACK> sent by the receiver. 2) When window scaling is in effect, the receiver SHOULD track the actual maximum window sequence number (which is likely to be greater than the window announced by the most recent <ACK>, if more than one segment has arrived since the application consumed any data in the receive buffer). There is no "When window scaling is in effect," on the first requirement. And it "happens" to be implementable by the second requirement (with or without window scaling).
Right, I saw that, but the first requirement doesn't mention the "actual maximum sequence number" which this new field represents.
I think an improvement could be to refer to the receiver requirements specifically here.
Ah, yes, that sounds like a good idea.
quoted
More importantly: do we need to restore this on a connection that's being dumped and recreated using TCP_REPAIR, or will things still work (even though sub-optimally) if we lose this value? Other window values that *need* to be dumped and restored are currently available via TCP_REPAIR_WINDOW socket option, and they are listed in do_tcp_getsockopt(), net/ipv4/tcp.c: opt.snd_wl1 = tp->snd_wl1; opt.snd_wnd = tp->snd_wnd; opt.max_window = tp->max_window; opt.rcv_wnd = tp->rcv_wnd; opt.rcv_wup = tp->rcv_wup; CRIU uses it to checkpoint and restore established connections, and passt uses it to migrate them to a different host: https://criu.org/TCP_connection https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063 If it's strictly needed to preserve functionality, we would need to add it to struct tcp_repair_window, notify CRIU maintainers (or send them a patch), and add this in passt as well (I can take care of it).Thanks for the pointer, I missed that tp->rcv_wnd update. Could the following happen when checkpointing/restoring? 1. A client app opens a connection and writes (blocking) a specific amount of data before doing any reads. (Not very clever, but this is supposed to work; this is what caused the problem in the Valkey tests.) 2. The traffic pattern causes an out-of-memory condition for the receive buffer; we see the RWIN 0 segments that do not ack the last data segment(s). 3. TCP connection is checkpointed and restored (on the client side) without restoring rcv_mwnd_seq. 4. If the receive buffer is still full at the new location, the acceptable sequence numbers in the receive window will not change (restored client is still blocked on write) and we no longer have the larger max receive window -> the client's kernel will reject all incoming packets and the connection is stuck. If this scenario is possible, I'd argue that rcv_mwnd_seq is necessary.
It really sounds like a corner case, especially 1. in combination with 2., but the outcome would be pretty bad, and I think it's possible. Typically, once the connection is restored (with TCP_REPAIR_OFF, not with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a window probe / keepalive, but as far as I understand that wouldn't be enough to fix the situation. And even if it did, we still have the TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage.
quoted
Strictly speaking, in case, this could be considered a breaking change for userspace, but I don't see how to avoid it, so I'd just make sure it doesn't impact users as TCP_REPAIR has just a couple of (known!) projects relying on it. An alternative would be to have a special, initial value representing the fact that this value was lost, but it looks really annoying to not be able to use a u32 for it.Do we need a dedicated value indicating that rcv_mwnd_seq is not present, or is it enough to choose an initial rcv_mwnd_seq based on the size of the struct passed? Both seem doable to me: Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly leading to the problem described above, of course)
Well but if we might run into the problem described above, we need to dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue at all. Except for a compatibility issue, but what you describe looks like a reasonable fallback.
Default value 0: Store how much we retracted the window, i.e. rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not retracted and could double as the "we don't know" value. For the time being, I will just initialize rcv_mwnd_seq to rcv_wup + rcv_wnd in tcp_repair_set_window() to keep status quo. Of course, I am happy to discuss enhancements.
That makes sense to me at a glance, but I should still review / test it as a whole though.
quoted
Disregard all this if the correct value is not strictly needed for functionality, of course. I haven't tested things (not yet, at least).quoted
u64 tcp_wstamp_ns; /* departure time for next sent data packet */ u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */diff --git a/include/net/tcp.h b/include/net/tcp.h index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h@@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp) return (u32) win; } +/* Compute the maximum receive window we ever advertised. + * Rcv_nxt can be after the window if our peer push more datas/push/pushes/ s/Rcv_nxt/rcv_nxt/ (useful for grepping)tcp_max_receive_window() is an adapted copy of tcp_receive_window() above. But it makes sense to improve it.
Ah, sorry, I didn't notice.
quoted
quoted
+ * than the offered window. + */ +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp) +{ + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt; + + if (win < 0) + win = 0;I must be missing something but... if the sequence is about to wrap, we'll return 0 here. Is that intended? Doing the subtraction unsigned would have looked more natural to me, but I didn't really think it through.The substraction is unsigned and the outcome is interpreted as signed. And as mentioned, it is copied with pride ;-)
Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride, you say. :) ...but sure, if it's taken from there, it makes sense to keep it like that I guess.
quoted
quoted
+ return (u32) win;Kernel coding style doesn't usually include a space between cast and identifier.Yes, same reason as above and I will change it.
-- Stefano