Re: [PATCH v5 net-next 01/36] net: Introduce direct data placement tcp offload
From: Boris Pismenny <hidden>
Date: 2021-07-22 13:42:24
Also in:
linux-nvme
On 22/07/2021 16:10, Eric Dumazet wrote:
On Thu, Jul 22, 2021 at 2:18 PM Boris Pismenny [off-list ref] wrote:quoted
On 22/07/2021 14:26, Eric Dumazet wrote:quoted
On Thu, Jul 22, 2021 at 1:04 PM Boris Pismenny [off-list ref] wrote:quoted
From: Boris Pismenny <redacted>...quoted
}; const chardiff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e6ca5a1f3b59..4a7160bba09b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c@@ -5149,6 +5149,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, memcpy(nskb->cb, skb->cb, sizeof(skb->cb)); #ifdef CONFIG_TLS_DEVICE nskb->decrypted = skb->decrypted; +#endif +#ifdef CONFIG_ULP_DDP + nskb->ddp_crc = skb->ddp_crc;Probably you do not want to attempt any collapse if skb->ddp_crc is set right there.Right.quoted
quoted
#endif TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start; if (list)@@ -5182,6 +5185,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) goto end; +#endif +#ifdef CONFIG_ULP_DDP + + if (skb->ddp_crc != nskb->ddp_crc)This checks only the second, third, and remaining skbs.Right, as we handle the head skb above. Could you clarify?I was simply saying you missed the first skb.
But, the first SKB got handled in the change above. The code here is the same for TLS, if it is wrong, then we already have an issue here.
quoted
quoted
quoted
+ goto end; #endif } }tcp_collapse() is copying data from small skbs to pack it to bigger skb (one page of payload), in case of memory emergency/pressure (socket queues are full) If your changes are trying to avoid 'needless' copies, maybe you should reconsider and let the emergency packing be done. If the copy is not _possible_, you should rephrase your changelog to clearly state the kernel _cannot_ access this memory in any way.The issue is that skb_condense also gets called on many skbs in tcp_add_backlog and it will identify skbs that went through DDP as ideal for packing, even though they are not small and packing is counter-productive as data already resides in its destination. As mentioned above, it is possible to copy, but it is counter-productive in this case. If there was a real need to access this memory, then it is allowed.Standard GRO packets from high perf drivers have no room in their skb->head (ie skb_tailroom() should be 0) If you have a driver using GRO and who pulled some payload in skb->head, it is already too late for DDP. So I think you are trying to add code in TCP that should not be needed. Perhaps mlx5 driver is doing something it should not ? (If this is ' copybreak' this has been documented as being suboptimal, transports have better strategies) Secondly, tcp_collapse() should absolutely not be called under regular workloads. Trying to optimize this last-resort thing is a lost cause: If an application is dumb enough to send small packets back-to-back, it should be fixed (sender side has this thing called autocork, for applications that do not know about MSG_MORE or TC_CORK.) (tcp_collapse is a severe source of latencies)
Sorry. My response above was about skb_condense which I've confused with tcp_collapse. In tcp_collapse, we could allow the copy, but the problem is CRC, which like TLS's skb->decrypted marks that the data passed the digest validation in the NIC. If we allow collapsing SKBs with mixed marks, we will need to force software copy+crc verification. As TCP collapse is indeed rare and the offload is opportunistic in nature, we can make this change and submit another version, but I'm confused; why was it OK for TLS, while it is not OK for DDP+CRC?