Thread (13 messages) 13 messages, 4 authors, 2026-02-26

Re: [PATCH RFC net-next 1/4] tcp: implement RFC 7323 window retraction receiver requirements

From: Stefano Brivio <hidden>
Date: 2026-02-25 21:33:43
Also in: linux-doc, linux-kselftest, lkml

On Tue, 24 Feb 2026 19:07:45 +0100
Simon Baatz [off-list ref] wrote:
Hi Stefano,

On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
quoted
Hi Simon,

It all makes sense to me at a quick look, I have just some nits and one
more substantial worry, below:

On Fri, 20 Feb 2026 00:55:14 +0100
Simon Baatz via B4 Relay [off-list ref] wrote:
  
quoted
From: Simon Baatz <redacted>

By default, the Linux TCP implementation does not shrink the
advertised window (RFC 7323 calls this "window retraction") with the
following exceptions:

- When an incoming segment cannot be added due to the receive buffer
  running out of memory. Since commit 8c670bdfa58e ("tcp: correct
  handling of extreme memory squeeze") a zero window will be
  advertised in this case. It turns out that reaching the required
  "memory pressure" is very easy when window scaling is in use. In the
  simplest case, sending a sufficient number of segments smaller than
  the scale factor to a receiver that does not read data is enough.

  Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
  happens much earlier than before, leading to regressions (the test
  suite of the Valkey project does not pass because of a TCP
  connection that is no longer bi-directional).  
Ouch. By the way, that same commit helped us unveil an issue (at least
in the sense of RFC 9293, 3.8.6) we fixed in passt:

  https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3  
This looks concerning: It seems as if just filling the advertised
window triggered the out of memory condition(?).
Right, even if it's not so much a general "out of memory" condition:
it's just that the socket might simply refuse to queue more data at
that point (we run out of window space, rather than memory).

Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), we will even get zero-window updates in that case. Jon
raised the issue here:

  https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@redhat.com/ (local)

but it was not really fixed. Anyway:
Am I right in
assuming that this happened with the original 1d2fbaad7cd8, not the
relaxed version of tcp_can_ingest() from f017c1f768b?
...you're right. I wasn't even aware of f017c1f768b, thanks for
pointing that out. That seems to make things saner, and I don't expect
further issues at this point.

By the way of which, passt struggled talking to applications entirely
written in the 21st century. That's socat, I think started in 2001,
being used in Podman tests, and its only SO_RCVBUF-related fault is
that it uses the default 208 KiB value (from rmem_default) as a
starting value by... not doing anything.

Applications can set SO_RCVBUF and SO_SNDBUF to bigger values
(depending on rmem_max and wmem_max), but if they do, automatic tuning
of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is
disabled. We used to do that in passt itself, and I eventually dropped
it here:

  https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06

because we might really need automatic tuning and the resulting big
buffers for high latency, high throughput connections.
quoted
quoted
- Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
  allowing the tcp window to shrink") addressed the "eating memory"
  problem by introducing a sysctl knob that allows shrinking the
  window before running out of memory.

However, RFC 7323 does not only state that shrinking the window is
necessary in some cases, it also formulates requirements for TCP
implementations when doing so (Section 2.4).

This commit addresses the receiver-side requirements: After retracting
the window, the peer may have a snd_nxt that lies within a previously
advertised window but is now beyond the retracted window. This means
that all incoming segments (including pure ACKs) will be rejected
until the application happens to read enough data to let the peer's
snd_nxt be in window again (which may be never).

To comply with RFC 7323, the receiver MUST honor any segment that
would have been in window for any ACK sent by the receiver and, when
window scaling is in effect, SHOULD track the maximum window sequence
number it has advertised. This patch tracks that maximum window
sequence number throughout the connection and uses it in
tcp_sequence() when deciding whether a segment is acceptable.
Acceptability of data is not changed.

Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
Signed-off-by: Simon Baatz <redacted>
---
 Documentation/networking/net_cachelines/tcp_sock.rst       |  1 +
 include/linux/tcp.h                                        |  1 +
 include/net/tcp.h                                          | 14 ++++++++++++++
 net/ipv4/tcp_fastopen.c                                    |  1 +
 net/ipv4/tcp_input.c                                       |  6 ++++--
 net/ipv4/tcp_minisocks.c                                   |  1 +
 net/ipv4/tcp_output.c                                      | 12 ++++++++++++
 .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt       |  2 +-
 8 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -121,6 +121,7 @@ u64                           delivered_mstamp        read_write
 u32                           rate_delivered                              read_mostly         tcp_rate_gen
 u32                           rate_interval_us                            read_mostly         rate_delivered,rate_app_limited
 u32                           rcv_wnd                 read_write          read_mostly         tcp_select_window,tcp_receive_window,tcp_fast_path_check
+u32                           rcv_mwnd_seq            read_write                              tcp_select_window
 u32                           write_seq               read_write                              tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
 u32                           notsent_lowat           read_mostly                             tcp_stream_memory_free
 u32                           pushed_seq              read_write                              tcp_mark_push,forced_push
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -271,6 +271,7 @@ struct tcp_sock {
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
 	u32	mdev_us;	/* medium deviation			*/
 	u32	rtt_seq;	/* sequence number to update rttvar	*/
+	u32	rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */  
Nit: tab between ; and /* for consistency (I would personally prefer
the comment style as you see on 'highest_sack' but I don't think it's
enforced anymore).  
Thanks, I missed that.
 
quoted
Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
here because the relevant paragraph there covers a very specific case of
window retraction, caused by quantisation error from window scaling,
which is not the most common case here. I couldn't quickly find a better
reference though.  
I agree, but there is a part that, I think, is more generally
applicable:

2.4.  Addressing Window Retraction

   [ specific window retraction case introduction removed ]
   ... Implementations MUST ensure that they handle a shrinking
   window, as specified in Section 4.2.2.16 of [RFC1122].

   For the receiver, this implies that:

   1)  The receiver MUST honor, as in window, any segment that would
       have been in window for any <ACK> sent by the receiver.

   2)  When window scaling is in effect, the receiver SHOULD track the
       actual maximum window sequence number (which is likely to be
       greater than the window announced by the most recent <ACK>, if
       more than one segment has arrived since the application consumed
       any data in the receive buffer).

There is no "When window scaling is in effect," on the first
requirement. And it "happens" to be implementable by the second
requirement (with or without window scaling).
Right, I saw that, but the first requirement doesn't mention the
"actual maximum sequence number" which this new field represents.
I think an improvement could be to refer to the receiver requirements
specifically here.
Ah, yes, that sounds like a good idea.
quoted
More importantly: do we need to restore this on a connection that's
being dumped and recreated using TCP_REPAIR, or will things still work
(even though sub-optimally) if we lose this value?

Other window values that *need* to be dumped and restored are currently
available via TCP_REPAIR_WINDOW socket option, and they are listed in
do_tcp_getsockopt(), net/ipv4/tcp.c:

		opt.snd_wl1	= tp->snd_wl1;
		opt.snd_wnd	= tp->snd_wnd;
		opt.max_window	= tp->max_window;
		opt.rcv_wnd	= tp->rcv_wnd;
		opt.rcv_wup	= tp->rcv_wup;

CRIU uses it to checkpoint and restore established connections, and
passt uses it to migrate them to a different host:

  https://criu.org/TCP_connection

  https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063

If it's strictly needed to preserve functionality, we would need to add
it to struct tcp_repair_window, notify CRIU maintainers (or send them a
patch), and add this in passt as well (I can take care of it).  
Thanks for the pointer, I missed that tp->rcv_wnd update.  Could the
following happen when checkpointing/restoring?

1. A client app opens a connection and writes (blocking) a specific amount
   of data before doing any reads.  (Not very clever, but this is
   supposed to work; this is what caused the problem in the Valkey
   tests.)
2. The traffic pattern causes an out-of-memory condition for the
   receive buffer; we see the RWIN 0 segments that do not ack the
   last data segment(s).
3. TCP connection is checkpointed and restored (on the client side) without
   restoring rcv_mwnd_seq.
4. If the receive buffer is still full at the new location, the
   acceptable sequence numbers in the receive window will not change
   (restored client is still blocked on write) and we no longer have
   the larger max receive window -> the client's kernel will reject
   all incoming packets and the connection is stuck.

If this scenario is possible, I'd argue that rcv_mwnd_seq is
necessary.
It really sounds like a corner case, especially 1. in combination with
2., but the outcome would be pretty bad, and I think it's possible.

Typically, once the connection is restored (with TCP_REPAIR_OFF, not
with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a
window probe / keepalive, but as far as I understand that wouldn't be
enough to fix the situation. And even if it did, we still have the
TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage.
quoted
Strictly speaking, in case, this could be considered a breaking change
for userspace, but I don't see how to avoid it, so I'd just make sure
it doesn't impact users as TCP_REPAIR has just a couple of (known!)
projects relying on it.

An alternative would be to have a special, initial value representing
the fact that this value was lost, but it looks really annoying to not
be able to use a u32 for it.  
Do we need a dedicated value indicating that rcv_mwnd_seq is not
present, or is it enough to choose an initial rcv_mwnd_seq based on
the size of the struct passed?  Both seem doable to me:

Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly
leading to the problem described above, of course)
Well but if we might run into the problem described above, we need to
dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue
at all.

Except for a compatibility issue, but what you describe looks like a
reasonable fallback.
Default value 0: Store how much we retracted the window, i.e. 
rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not
retracted and could double as the "we don't know" value.

For the time being, I will just initialize rcv_mwnd_seq to rcv_wup +
rcv_wnd in tcp_repair_set_window() to keep status quo. Of course,
I am happy to discuss enhancements.
That makes sense to me at a glance, but I should still review / test it
as a whole though.
quoted
Disregard all this if the correct value is not strictly needed for
functionality, of course. I haven't tested things (not yet, at least).
  
quoted
 	u64	tcp_wstamp_ns;	/* departure time for next sent data packet */
 	u64	accecn_opt_tstamp;	/* Last AccECN option sent timestamp */
 	struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
 	return (u32) win;
 }
 
+/* Compute the maximum receive window we ever advertised.
+ * Rcv_nxt can be after the window if our peer push more data  
s/push/pushes/

s/Rcv_nxt/rcv_nxt/ (useful for grepping)  
tcp_max_receive_window() is an adapted copy of
tcp_receive_window() above. But it makes sense to improve it.
Ah, sorry, I didn't notice.
quoted
quoted
+ * than the offered window.
+ */
+static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
+{
+	s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
+
+	if (win < 0)
+		win = 0;  
I must be missing something but... if the sequence is about to wrap,
we'll return 0 here. Is that intended?

Doing the subtraction unsigned would have looked more natural to me,
but I didn't really think it through.  
The substraction is unsigned and the outcome is interpreted as
signed. And as mentioned, it is copied with pride ;-)
Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride,
you say. :) ...but sure, if it's taken from there, it makes sense to
keep it like that I guess.
quoted
quoted
+	return (u32) win;  
Kernel coding style doesn't usually include a space between cast and
identifier.  
Yes, same reason as above and I will change it.
-- 
Stefano
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help