Re: [PATCH v2 2/3] nvme-tcp: support specifying the congestion-control
From: Sagi Grimberg <sagi@grimberg.me>
Date: 2022-03-29 07:46:32
Also in:
linux-nvme, lkml
quoted
As I said, TCP can be tuned in various ways, congestion being just one of them. I'm sure you can find a workload where rmem/wmem will make a difference.agree. but the difference for the knob of rmem/wmem is: we could enlarge rmem/wmem for NVMe/TCP via sysctl, and it would not bring downside to any other sockets whose rmem/wmem are not explicitly specified.
It can most certainly affect them, positively or negatively, depends on the use-case.
quoted
In addition, based on my knowledge, application specific TCP level tuning (like congestion) is not really a common thing to do. So why in nvme-tcp? So to me at least, it is not clear why we should add it to the driver.As mentioned in the commit message, though we can specify the congestion-control of NVMe_over_TCP via sysctl or writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also changes the congestion-control of all the future TCP sockets on the same host that have not been explicitly assigned the congestion-control, thus bringing potential impaction on their performance. For example: A server in a data-center with the following 2 NICs: - NIC_fron-end, for interacting with clients through WAN (high latency, ms-level) - NIC_back-end, for interacting with NVMe/TCP target through LAN (low latency, ECN-enabled, ideal for dctcp) This server interacts with clients (handling requests) via the fron-end network and accesses the NVMe/TCP storage via the back-end network. This is a normal use case, right? For the client devices, we can’t determine their congestion-control. But normally it’s cubic by default (per the CONFIG_DEFAULT_TCP_CONG). So if we change the default congestion control on the server to dctcp on behalf of the NVMe/TCP traffic of the LAN side, it could at the same time change the congestion-control of the front-end sockets to dctcp while the congestion-control of the client-side is cubic. So this is an unexpected scenario. In addition, distributed storage products like the following also have the above problem: - The product consists of a cluster of servers. - Each server serves clients via its front-end NIC (WAN, high latency). - All servers interact with each other via NVMe/TCP via back-end NIC (LAN, low latency, ECN-enabled, ideal for dctcp).
Separate networks are still not application (nvme-tcp) specific and as mentioned, we have a way to control that. IMO, this still does not qualify as solid justification to add this to nvme-tcp. What do others think?