Re: [PATCH] vsock: Enable H2G override
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: 2026-03-03 20:53:07
Also in:
kvm, lkml, virtualization
On Tue, Mar 03, 2026 at 09:47:26PM +0100, Alexander Graf wrote:
On 03.03.26 15:17, Bryan Tan wrote:quoted
On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella [off-list ref] wrote:quoted
On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:quoted
On 02.03.26 17:25, Stefano Garzarella wrote:quoted
On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:quoted
On 02.03.26 13:06, Stefano Garzarella wrote:quoted
CCing Bryan, Vishnu, and Broadcom list. On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:quoted
Please target net-next tree for this new feature. On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:quoted
Vsock maintains a single CID number space which can be used to communicate to the host (G2H) or to a child-VM (H2G). The current logic trivially assumes that G2H is only relevant for CID <= 2 because these target the hypervisor. However, in environments like Nitro Enclaves, an instance that hosts vhost_vsock powered VMs may still want to communicate to Enclaves that are reachable at higher CIDs through virtio-vsock-pci. That means that for CID > 2, we really want an overlay. By default, all CIDs are owned by the hypervisor. But if vhost registers a CID, it takes precedence. Implement that logic. Vhost already knows which CIDs it supports anyway. With this logic, I can run a Nitro Enclave as well as a nested VM with vhost-vsock support in parallel, with the parent instance able to communicate to both simultaneously.I honestly don't understand why VMADDR_FLAG_TO_HOST (added specifically for Nitro IIRC) isn't enough for this scenario and we have to add this change. Can you elaborate a bit more about the relationship between this change and VMADDR_FLAG_TO_HOST we added?The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that it punts the complexity to the user. Instead of a single CID address space, you now effectively create 2 spaces: One for TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every user space tool needs to learn about this flag. That may work for super special-case applications. But propagating that all the way into socat, iperf, etc etc? It's just creating friction.Okay, I would like to have this (or part of it) in the commit message to better explain why we want this change.quoted
IMHO the most natural experience is to have a single CID space, potentially manually segmented by launching VMs of one kind within a certain range.I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST in the remote address if that path is taken "automagically" ? So in that way the user space can have a way to understand if it's talking with a nested guest or a sibling guest. That said, I'm concerned about the scenario where an application does not even consider communicating with a sibling VM.If that's really a realistic concern, then we should add a VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of an application that provides no flags is "route to whatever you can find": If vhost is loaded, it routes to vhost. If a vsock backendmmm, we have always documented this simple behavior: - CID = 2 talks to the host - CID >= 3 talks to the guest Now we are changing this by adding fallback. I don't think we should change the default behavior, but rather provide new ways to enable this new behavior. I find it strange that an application running on Linux 7.0 has a default behavior where using CID=42 always talks to a nested VM, but starting with Linux 7.1, it also starts talking to a sibling VM.quoted
driver is loaded, it routes there. But the application has no say in where it goes: It's purely a system configuration thing.This is true for complex things like IP, but for VSOCK we have always wanted to keep the default behavior very simple (as written above). Everything else must be explicitly enabled IMHO.quoted
quoted
Until now, it knew that by not setting that flag, it could only talk to nested VMs, so if there was no VM with that CID, the connection simply failed. Whereas from this patch onwards, if the device in the host supports sibling VMs and there is a VM with that CID, the application finds itself talking to a sibling VM instead of a nested one, without having any idea.I'd say an application that attempts to talk to a CID that it does now know whether it's vhost routed or not is running into "undefined" territory. If you rmmod the vhost driver, it would also talk to the hypervisor provided vsock.Oh, I missed that. And I also fixed that behaviour with commit 65b422d9b61b ("vsock: forward all packets to the host when no H2G is registered") after I implemented the multi-transport support. mmm, this could change my position ;-) (although, to be honest, I don't understand why it was like that in the first place, but that's how it is now). Please document also this in the new commit message, is a good point. Although when H2G is loaded, we behave differently. However, it is true that sysctl helps us standardize this behavior. I don't know whether to see it as a regression or not.quoted
quoted
Should we make this feature opt-in in some way, such as sockopt or sysctl? (I understand that there is the previous problem, but honestly, it seems like a significant change to the behavior of AF_VSOCK).We can create a sysctl to enable behavior with default=on. But I'm against making the cumbersome does-not-work-out-of-the-box experience the default. Will include it in v2.The opposite point of view is that we would not want to have different default behavior between 7.0 and 7.1 when H2G is loaded.From a VMCI perspective, we only allow communication from guest to host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end up attempting guest to guest communication. As mentioned this does already happen if there isn't an H2G transport registered, so we should be handling this anyways. But I'm not too fond of the change in behaviour for when H2G is present, so in the very least I'd prefer if has_remote_cid is not implemented for VMCI. Or perhaps if there was a way for G2H transport to explicitly note that it supports CIDs that are greater than 2? With this, it would be easier to see this patch as preserving the default behaviour for some transports and fixing a bug for others.I understand what you want, but beware that it's actually a change in behavior. Today, whether Linux will send vsock connects to VMCI depends on whether the vhost kernel module is loaded: If it's loaded, you don't see the connect attempt. If it's not loaded, the connect will come through to VMCI. I agree that it makes sense to limit VMCI to only ever see connects to <= 2 consistently. But as I said above, it's actually a change in behavior. Alex
I think it was unintentional, but if you really think people want a special module that changes kernel's behaviour on load, we can certainly do that. But any hack like this will not be namespace safe.
Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597