Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

From: Neal Cardwell <ncardwell@google.com>
Date: 2023-05-22 15:05:18
Also in: lkml

On Sun, May 21, 2023 at 10:55 PM Menglong Dong [off-list ref] wrote:

On Sat, May 20, 2023 at 10:28 PM Neal Cardwell [off-list ref] wrote:

quoted

On Sat, May 20, 2023 at 5:08 AM Menglong Dong [off-list ref] wrote:

quoted

On Fri, May 19, 2023 at 12:03 AM Neal Cardwell [off-list ref] wrote:

quoted

On Thu, May 18, 2023 at 10:12 AM Menglong Dong [off-list ref] wrote:

quoted

On Thu, May 18, 2023 at 9:40 PM Neal Cardwell [off-list ref] wrote:

quoted

On Wed, May 17, 2023 at 10:35 PM Menglong Dong [off-list ref] wrote:

quoted

On Wed, May 17, 2023 at 10:47 PM Eric Dumazet [off-list ref] wrote:

quoted

On Wed, May 17, 2023 at 2:42 PM [off-list ref] wrote:

quoted

From: Menglong Dong <redacted>

Window shrink is not allowed and also not handled for now, but it's
needed in some case.

In the origin logic, 0 probe is triggered only when there is no any
data in the retrans queue and the receive window can't hold the data
of the 1th packet in the send queue.

Now, let's change it and trigger the 0 probe in such cases:

- if the retrans queue has data and the 1th packet in it is not within
the receive window
- no data in the retrans queue and the 1th packet in the send queue is
out of the end of the receive window

Sorry, I do not understand.

Please provide packetdrill tests for new behavior like that.

Yes. The problem can be reproduced easily.

1. choose a server machine, decrease it's tcp_mem with:
    echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
2. call listen() and accept() on a port, such as 8888. We call
    accept() looply and without call recv() to make the data stay
    in the receive queue.
3. choose a client machine, and create 100 TCP connection
    to the 8888 port of the server. Then, every connection sends
    data about 1M.
4. we can see that some of the connection enter the 0-probe
    state, but some of them keep retrans again and again. As
    the server is up to the tcp_mem[2] and skb is dropped before
    the recv_buf full and the connection enter 0-probe state.
    Finially, some of these connection will timeout and break.

With this series, all the 100 connections will enter 0-probe
status and connection break won't happen. And the data
trans will recover if we increase tcp_mem or call 'recv()'
on the sockets in the server.

quoted

Also, such fundamental change would need IETF discussion first.
We do not want linux to cause network collapses just because billions
of devices send more zero probes.

I think it maybe a good idea to make the connection enter
0-probe, rather than drop the skb silently. What 0-probe
meaning is to wait for space available when the buffer of the
receive queue is full. And maybe we can also use 0-probe
when the "buffer" of "TCP protocol" (which means tcp_mem)
is full?

Am I right?

Thanks!
Menglong Dong

Thanks for describing the scenario in more detail. (Some kind of
packetdrill script or other program to reproduce this issue would be
nice, too, as Eric noted.)

You mention in step (4.) above that some of the connections keep
retransmitting again and again. Are those connections receiving any
ACKs in response to their retransmissions? Perhaps they are receiving
dupacks?

Actually, these packets are dropped without any reply, even dupacks.
skb will be dropped directly when tcp_try_rmem_schedule()
fails in tcp_data_queue(). That's reasonable, as it's
useless to reply a ack to the sender, which will cause the sender
fast retrans the packet, because we are out of memory now, and
retrans can't solve the problem.

I'm not sure I see the problem. If retransmits can't solve the
problem, then why are you proposing that data senders keep
retransmitting forever (via 0-window-probes) in this kind of scenario?

Because the connection will break if the count of
retransmits up to tcp_retires2, but probe-0 can keep
for a long time.

I see. So it sounds like you agree that retransmits can solve the
problem, as long as the retransmits are using the zero-window probe
state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
as long as the receiver is sending ACKs. And it sounds like when you
said "retrans can't solve the problem" you didn't literally mean that
retransmits can't solve the problem, but rather you meant that the RTO
state machine, specifically (ICSK_TIME_RETRANS,
tcp_retransmit_timer(), etc) can't solve the problem. I agree with
that assessment that in this scenario tcp_probe_timer() seems like a
solution but tcp_retransmit_timer() does not.

Yes, that is indeed what I want to express.

quoted

A single dupack without SACK blocks will not cause the sender to fast
retransmit. (Only 3 dupacks would trigger fast retransmit.)

Three or more dupacks without SACK blocks will cause the sender to
fast retransmit the segment above SND.UNA once if the sender doesn't
have SACK support. But in this case AFAICT fast-retransmitting once is
a fine strategy, since the sender should keep retrying transmits (with
backoff) until the receiver potentially has memory available to
receive the packet.

quoted

If so, then perhaps we could solve this problem without
depending on a violation of the TCP spec (which says the receive
window should not be retracted) in the following way: when a data
sender suffers a retransmission timeout, and retransmits the first
unacknowledged segment, and receives a dupack for SND.UNA instead of
an ACK covering the RTO-retransmitted segment, then the data sender
should estimate that the receiver doesn't have enough memory to buffer
the retransmitted packet. In that case, the data sender should enter
the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
call tcp_probe_timer().

Basically we could try to enhance the sender-side logic to try to
distinguish between two kinds of problems:

(a) Repeated data packet loss caused by congestion, routing problems,
or connectivity problems. In this case, the data sender uses
ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
retries sysctl_tcp_retries2 times before timing out the connection

(b) A receiver that is repeatedly sending dupacks but not ACKing
retransmitted data because it doesn't have any memory. In this case,
the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
off but keeps retrying as long as the data sender receives ACKs.

I'm not sure if this is an ideal method, as it may be not rigorous
to conclude that the receiver is oom with dupacks. A packet can
loss can also cause multi dupacks.

When a data sender suffers an RTO and retransmits a single data
packet, it would be very rare for the data sender to receive multiple
pure dupacks without SACKs. This would only happen in the rare case
where (a) the connection did not have SACK enabled, and (b) there was
a hole in the received sequence space and there were still packets in
flight when the (spurioius) RTO fired.

But if we want to be paranoid, then this new response could be written
to only trigger if SACK is enabled (the vast, vast majority of cases).
If SACK is enabled, and an RTO of a data packet starting at sequence
S1 results in the receiver sending only a dupack for S1 without SACK
blocks, then this clearly shows the issue is not packet loss but
suggests a receiver unable to buffer the given data packet, AFAICT.

Yeah, you are right on this point, multi pure dupacks can
mean out of memory of the receiver. But we still need to
know if the receiver recovers from OOM. Without window
shrink, the window in the ack of zero-window probe packet
is not zero on OOM.

But do we need a protocol-violating zero-window in this case? Why not
use my approach suggested above: conveying the OOM condition by
sending an ACK but not ACKing the retransmitted packet?

I agree with you about the approach you mentioned
about conveying the OOM condition. But that approach
can't convey the recovery from OOM, can it?

Yes, my suggested approach can convey the recovery from OOM. The data
receiver conveys the recovery from OOM by buffering and ACKing the
retransmitted data packet.

Let's see the process. With 3 pure dupack for SND.UNA,
we deem the OOM of the receiver and make the sender
enter zero-window probe state.

AFAICT the data sender does not need to wait for 3 pure dpacks for
SND.UNA. AFAICT a data sender that suffers an RTO and finds that its
RTO gets a response that is a single dupack for SND.UNA could estimate
that the receiver is OOM, and enter the zero-window probe state.

The sender will keep sending probe0 packets, and the
receiver will reply an ack. However, as we don't
shrink the window actually, the window in the ack is
not zero on OOM, so we can't know if the receiver has
recovered from OOM and retransmit the data in retransmit
queue.

As I noted above, in my proposal the data receiver conveys the
recovery from OOM by buffering and ACKing the retransmitted data
packet.

BTW, the probe0 will send the last byte that was already
acked, so the ack of the probe0 will be a pure dupack.

Did I miss something?

I don't think that's the case. My read of tcp_write_wakeup() is that
it will send an skb that is whatever prefix of 1 MSS is allowed by the
receiver window. In the scenario we are discussing, that would mean
that it sends a full 1 MSS. So AFAICT tcp_write_wakeup() could be used
for sending a data "probe" packet for this OOM case.

BTW, a previous patch has explained the need to
support window shrink, which should satisfy the RFC
of TCP protocol:

https://lore.kernel.org/netdev/20230308053353.675086-1-mfreemon@cloudflare.com/ (local)

Let's see what Eric thinks about that patch.

neal

Thanks!
Menglong Dong

quoted

Thanks,
neal

quoted

Hi, Eric and kuba, do you have any comments on this
case?

Thanks!
Menglong Dong

quoted

thanks,
neal

quoted

Thanks!
Menglong Dong

quoted

AFAICT that would be another way to reach the happy state you mention:
"all the 100 connections will enter 0-probe status and connection
break won't happen", and we could reach that state without violating
the TCP protocol spec and without requiring changes on the receiver
side (so that this fix could help in scenarios where the
memory-constrained receiver is an older stack without special new
behavior).

Eric, Yuchung, Menglong: do you think something like that would work?

neal

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help