RE: [PATCH v4,1/2] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2019-08-14 15:33:18
Also in:
linux-pci, lkml
-----Original Message----- From: Bjorn Helgaas <helgaas@kernel.org> Sent: Wednesday, August 14, 2019 12:34 AM To: Haiyang Zhang <haiyangz@microsoft.com> Cc: sashal@kernel.org; lorenzo.pieralisi@arm.com; linux- hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan [off-list ref]; Stephen Hemminger [off-list ref]; olaf@aepfle.de; vkuznets [off-list ref]; linux- kernel@vger.kernel.org Subject: Re: [PATCH v4,1/2] PCI: hv: Detect and fix Hyper-V PCI domain number collision Thanks for splitting these; I think that makes more sense. On Wed, Aug 14, 2019 at 12:38:54AM +0000, Haiyang Zhang wrote:quoted
Currently in Azure cloud, for passthrough devices including GPU, the host sets the device instance ID's bytes 8 - 15 to a value derived from the host HWID, which is the same on all devices in a VM. So, the device instance ID's bytes 8 and 9 provided by the host are no longer unique. This can cause device passthrough to VMs to fail because the bytes 8 and 9 are used as PCI domain number. Collision of domain numbers will cause the second device with the same domain number fail to load.I think this patch is fine. I could be misunderstanding the commit log, but when you say "the ID bytes 8 and 9 are *no longer* unique", that suggests that they *used* to be unique but stopped being unique at some point, which of course raises the question of *when* they became non-unique. The specific information about that point would be useful to have in the commit log, e.g., is this related to a specific version of Azure, a configuration change, etc?
The host side change happened last year, rolled out to all azure hosts. I will put "all current azure hosts" in the commit log.
Does this problem affect GPUs more than other passthrough devices? If all passthrough devices are affected, why mention GPUs in particular? I can't tell whether that information is relevant or superfluous.
We found this issue initially on multiple passthrough GPUs, I mentioned this just as an example. I will remove this word, because any PCI devices may be affected. Thanks, - Haiyang