Thread (4 messages) 4 messages, 2 authors, 2019-08-14

RE: [PATCH v4,1/2] PCI: hv: Detect and fix Hyper-V PCI domain number collision

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2019-08-14 15:33:18
Also in: linux-pci, lkml

-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org>
Sent: Wednesday, August 14, 2019 12:34 AM
To: Haiyang Zhang <haiyangz@microsoft.com>
Cc: sashal@kernel.org; lorenzo.pieralisi@arm.com; linux-
hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan
[off-list ref]; Stephen Hemminger [off-list ref];
olaf@aepfle.de; vkuznets [off-list ref]; linux-
kernel@vger.kernel.org
Subject: Re: [PATCH v4,1/2] PCI: hv: Detect and fix Hyper-V PCI domain
number collision

Thanks for splitting these; I think that makes more sense.

On Wed, Aug 14, 2019 at 12:38:54AM +0000, Haiyang Zhang wrote:
quoted
Currently in Azure cloud, for passthrough devices including GPU, the host
sets the device instance ID's bytes 8 - 15 to a value derived from the host
HWID, which is the same on all devices in a VM. So, the device instance
ID's bytes 8 and 9 provided by the host are no longer unique. This can
cause device passthrough to VMs to fail because the bytes 8 and 9 are used
as PCI domain number. Collision of domain numbers will cause the second
device with the same domain number fail to load.
I think this patch is fine.  I could be misunderstanding the commit
log, but when you say "the ID bytes 8 and 9 are *no longer* unique",
that suggests that they *used* to be unique but stopped being unique
at some point, which of course raises the question of *when* they
became non-unique.

The specific information about that point would be useful to have in
the commit log, e.g., is this related to a specific version of Azure,
a configuration change, etc?
The host side change happened last year, rolled out to all azure hosts.
I will put "all current azure hosts" in the commit log.
Does this problem affect GPUs more than other passthrough devices?  If
all passthrough devices are affected, why mention GPUs in particular?
I can't tell whether that information is relevant or superfluous.
We found this issue initially on multiple passthrough GPUs, I mentioned this
just as an example. I will remove this word, because any PCI devices may
be affected.

Thanks,
- Haiyang
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help