RE: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2019-08-13 14:39:26
Also in:
linux-pci, lkml
-----Original Message----- From: Lorenzo Pieralisi <redacted> Sent: Tuesday, August 13, 2019 10:26 AM To: Haiyang Zhang <haiyangz@microsoft.com> Cc: sashal@kernel.org; bhelgaas@google.com; linux- hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan [off-list ref]; Stephen Hemminger [off-list ref]; olaf@aepfle.de; vkuznets [off-list ref]; linux- kernel@vger.kernel.org Subject: Re: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domain number collision On Tue, Aug 13, 2019 at 12:55:59PM +0000, Haiyang Zhang wrote:quoted
quoted
-----Original Message----- From: Lorenzo Pieralisi <redacted> Sent: Tuesday, August 13, 2019 6:14 AM To: Haiyang Zhang <haiyangz@microsoft.com> Cc: sashal@kernel.org; bhelgaas@google.com; linux- hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan [off-list ref]; Stephen Hemminger[off-list ref];quoted
quoted
olaf@aepfle.de; vkuznets [off-list ref]; linux- kernel@vger.kernel.org Subject: Re: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domainnumberquoted
quoted
collision On Mon, Aug 12, 2019 at 06:20:53PM +0000, Haiyang Zhang wrote:quoted
Currently in Azure cloud, for passthrough devices including GPU, thehostquoted
quoted
quoted
sets the device instance ID's bytes 8 - 15 to a value derived from thehostquoted
quoted
quoted
HWID, which is the same on all devices in a VM. So, the device instance ID's bytes 8 and 9 provided by the host are no longer unique. This can cause device passthrough to VMs to fail because the bytes 8 and 9 areusedquoted
quoted
quoted
as PCI domain number. Collision of domain numbers will cause thesecondquoted
quoted
quoted
device with the same domain number fail to load. As recommended by Azure host team, the bytes 4, 5 have moreuniquenessquoted
quoted
quoted
(info entropy) than bytes 8, 9. So now we use bytes 4, 5 as the PCIdomainquoted
quoted
quoted
numbers. On older hosts, bytes 4, 5 can also be used -- no backward compatibility issues here. The chance of collision is greatly reduced. In the rare cases of collision, we will detect and find another number thatisquoted
quoted
quoted
not in use.I have not explained what I meant correctly. This patch fixes an issue and the "find another number" fallback can be also applied to the current kernel without changing the bytes you use for domain numbers. This patch would leave old kernels susceptible to breakage. Again, I have no Azure knowledge but it seems better to me to add a fallback "find another number" allocation on top of mainline and send it to stable kernels. Then we can add another patch to change the bytes you use to reduce the number of collision. Please let me know what you think, thanks.Thanks for your clarification. Actually, I hope the stable kernel will be patched to use bytes 4,5 too, because host provided numbers are persistent across reboots, we like to use them if possible. I think we can either -- 1) Apply this patch for mainline and stable kernels as well. 2) Or, break this patch into two patches, and apply both of them for Mainline and stable kernels.(2) since one patch is a fix and the other one an (optional - however important it is) change. This way if the optional change needs reverting we still have a working kernel. In the end it is up to you - I am just expressing what I think is the most sensible way forward.
Sure, I agree with you, and will break the patch into two, and resubmit. Thanks, - Haiyang