Thread (60 messages) 60 messages, 8 authors, 2021-08-10

Re: [PATCH v5 net-next 01/36] net: Introduce direct data placement tcp offload

From: Boris Pismenny <hidden>
Date: 2021-07-22 13:42:24
Also in: linux-nvme

On 22/07/2021 16:10, Eric Dumazet wrote:
On Thu, Jul 22, 2021 at 2:18 PM Boris Pismenny [off-list ref] wrote:
quoted
On 22/07/2021 14:26, Eric Dumazet wrote:
quoted
On Thu, Jul 22, 2021 at 1:04 PM Boris Pismenny [off-list ref] wrote:
quoted
From: Boris Pismenny <redacted>
...
quoted
 };

 const char
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6ca5a1f3b59..4a7160bba09b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5149,6 +5149,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
                memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 #ifdef CONFIG_TLS_DEVICE
                nskb->decrypted = skb->decrypted;
+#endif
+#ifdef CONFIG_ULP_DDP
+               nskb->ddp_crc = skb->ddp_crc;
Probably you do not want to attempt any collapse if skb->ddp_crc is
set right there.
Right.
quoted
quoted
 #endif
                TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
                if (list)
@@ -5182,6 +5185,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 #ifdef CONFIG_TLS_DEVICE
                                if (skb->decrypted != nskb->decrypted)
                                        goto end;
+#endif
+#ifdef CONFIG_ULP_DDP
+
+                               if (skb->ddp_crc != nskb->ddp_crc)
This checks only the second, third, and remaining skbs.
Right, as we handle the head skb above. Could you clarify?
I was simply saying you missed the first skb.
But, the first SKB got handled in the change above. The code here is the
same for TLS, if it is wrong, then we already have an issue here.
quoted
quoted
quoted
+                                       goto end;
 #endif
                        }
                }

tcp_collapse() is copying data from small skbs to pack it to bigger
skb (one page of payload), in case
of memory emergency/pressure (socket queues are full)

If your changes are trying to avoid 'needless'  copies, maybe you
should reconsider and let the emergency packing be done.

If the copy is not _possible_, you should rephrase your changelog to
clearly state the kernel _cannot_ access this memory in any way.
The issue is that skb_condense also gets called on many skbs in
tcp_add_backlog and it will identify skbs that went through DDP as ideal
for packing, even though they are not small and packing is
counter-productive as data already resides in its destination.

As mentioned above, it is possible to copy, but it is counter-productive
in this case. If there was a real need to access this memory, then it is
allowed.
Standard GRO packets from high perf drivers have no room in their
skb->head (ie skb_tailroom() should be 0)

If you have a driver using GRO and who pulled some payload in
skb->head, it is already too late for DDP.

So I think you are trying to add code in TCP that should not be
needed. Perhaps mlx5 driver is doing something it should not ?
(If this is ' copybreak'  this has been documented as being
suboptimal, transports have better strategies)

Secondly, tcp_collapse() should absolutely not be called under regular
workloads.

Trying to optimize this last-resort thing is a lost cause:
If an application is dumb enough to send small packets back-to-back,
it should be fixed (sender side has this thing called autocork, for
applications that do not know about MSG_MORE or TC_CORK.)

(tcp_collapse is a severe source of latencies)

Sorry. My response above was about skb_condense which I've confused with
tcp_collapse.

In tcp_collapse, we could allow the copy, but the problem is CRC, which
like TLS's skb->decrypted marks that the data passed the digest
validation in the NIC. If we allow collapsing SKBs with mixed marks, we
will need to force software copy+crc verification. As TCP collapse is
indeed rare and the offload is opportunistic in nature, we can make this
change and submit another version, but I'm confused; why was it OK for
TLS, while it is not OK for DDP+CRC?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help