Thread (5 messages) 5 messages, 2 authors, 2019-08-13

RE: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domain number collision

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2019-08-13 14:39:26
Also in: linux-pci, lkml

-----Original Message-----
From: Lorenzo Pieralisi <redacted>
Sent: Tuesday, August 13, 2019 10:26 AM
To: Haiyang Zhang <haiyangz@microsoft.com>
Cc: sashal@kernel.org; bhelgaas@google.com; linux-
hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan
[off-list ref]; Stephen Hemminger [off-list ref];
olaf@aepfle.de; vkuznets [off-list ref]; linux-
kernel@vger.kernel.org
Subject: Re: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domain number
collision

On Tue, Aug 13, 2019 at 12:55:59PM +0000, Haiyang Zhang wrote:
quoted
quoted
-----Original Message-----
From: Lorenzo Pieralisi <redacted>
Sent: Tuesday, August 13, 2019 6:14 AM
To: Haiyang Zhang <haiyangz@microsoft.com>
Cc: sashal@kernel.org; bhelgaas@google.com; linux-
hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan
[off-list ref]; Stephen Hemminger
[off-list ref];
quoted
quoted
olaf@aepfle.de; vkuznets [off-list ref]; linux-
kernel@vger.kernel.org
Subject: Re: [PATCH v3] PCI: hv: Detect and fix Hyper-V PCI domain
number
quoted
quoted
collision

On Mon, Aug 12, 2019 at 06:20:53PM +0000, Haiyang Zhang wrote:
quoted
Currently in Azure cloud, for passthrough devices including GPU, the
host
quoted
quoted
quoted
sets the device instance ID's bytes 8 - 15 to a value derived from the
host
quoted
quoted
quoted
HWID, which is the same on all devices in a VM. So, the device instance
ID's bytes 8 and 9 provided by the host are no longer unique. This can
cause device passthrough to VMs to fail because the bytes 8 and 9 are
used
quoted
quoted
quoted
as PCI domain number. Collision of domain numbers will cause the
second
quoted
quoted
quoted
device with the same domain number fail to load.

As recommended by Azure host team, the bytes 4, 5 have more
uniqueness
quoted
quoted
quoted
(info entropy) than bytes 8, 9. So now we use bytes 4, 5 as the PCI
domain
quoted
quoted
quoted
numbers. On older hosts, bytes 4, 5 can also be used -- no backward
compatibility issues here. The chance of collision is greatly reduced. In
the rare cases of collision, we will detect and find another number that
is
quoted
quoted
quoted
not in use.
I have not explained what I meant correctly. This patch fixes an
issue and the "find another number" fallback can be also applied
to the current kernel without changing the bytes you use for
domain numbers.

This patch would leave old kernels susceptible to breakage.

Again, I have no Azure knowledge but it seems better to me to
add a fallback "find another number" allocation on top of mainline
and send it to stable kernels. Then we can add another patch to
change the bytes you use to reduce the number of collision.

Please let me know what you think, thanks.
Thanks for your clarification.
Actually, I hope the stable kernel will be patched to use bytes 4,5 too,
because host provided numbers are persistent across reboots, we like
to use them if possible.

I think we can either --
1) Apply this patch for mainline and stable kernels as well.
2) Or, break this patch into two patches, and apply both of them for
Mainline and stable kernels.
(2) since one patch is a fix and the other one an (optional - however
important it is) change.

This way if the optional change needs reverting we still have a working
kernel.

In the end it is up to you - I am just expressing what I think is the
most sensible way forward.
Sure, I agree with you, and will break the patch into two, and resubmit.

Thanks,
- Haiyang
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help