Thread (20 messages) 20 messages, 4 authors, 2022-03-31

Re: [PATCH v2 2/3] nvme-tcp: support specifying the congestion-control

From: Mingbao Sun <hidden>
Date: 2022-03-31 04:14:01
Also in: linux-nvme, lkml

On Tue, 29 Mar 2022 10:46:08 +0300
Sagi Grimberg [off-list ref] wrote:
quoted
quoted
As I said, TCP can be tuned in various ways, congestion being just one
of them. I'm sure you can find a workload where rmem/wmem will make
a difference.  
agree.
but the difference for the knob of rmem/wmem is:
we could enlarge rmem/wmem for NVMe/TCP via sysctl,
and it would not bring downside to any other sockets whose
rmem/wmem are not explicitly specified.  
It can most certainly affect them, positively or negatively, depends
on the use-case.
Agree.
Your saying is rigorous.
quoted
quoted
In addition, based on my knowledge, application specific TCP level
tuning (like congestion) is not really a common thing to do. So why in
nvme-tcp?

So to me at least, it is not clear why we should add it to the driver.  
As mentioned in the commit message, though we can specify the
congestion-control of NVMe_over_TCP via sysctl or writing
'/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets on
the same host that have not been explicitly assigned the
congestion-control, thus bringing potential impaction on their
performance.

For example:

A server in a data-center with the following 2 NICs:

     - NIC_fron-end, for interacting with clients through WAN
       (high latency, ms-level)

     - NIC_back-end, for interacting with NVMe/TCP target through LAN
       (low latency, ECN-enabled, ideal for dctcp)

This server interacts with clients (handling requests) via the fron-end
network and accesses the NVMe/TCP storage via the back-end network.
This is a normal use case, right?

For the client devices, we can’t determine their congestion-control.
But normally it’s cubic by default (per the CONFIG_DEFAULT_TCP_CONG).
So if we change the default congestion control on the server to dctcp
on behalf of the NVMe/TCP traffic of the LAN side, it could at the
same time change the congestion-control of the front-end sockets
to dctcp while the congestion-control of the client-side is cubic.
So this is an unexpected scenario.

In addition, distributed storage products like the following also have
the above problem:

     - The product consists of a cluster of servers.

     - Each server serves clients via its front-end NIC
      (WAN, high latency).

     - All servers interact with each other via NVMe/TCP via back-end NIC
      (LAN, low latency, ECN-enabled, ideal for dctcp).  
Separate networks are still not application (nvme-tcp) specific and as
mentioned, we have a way to control that. IMO, this still does not
qualify as solid justification to add this to nvme-tcp.

What do others think?
Well, per the fact that the approach (‘ip route …’) proposed
by Jakub could largely fit the per link requirement on
congestion-control, so the usefulness of this patchset is really
not so significant.

So here I terminate all the threads of this patchset.

At last, many thanks to all of you for reviewing this patchset.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help