Re: [RFC PATCH 00/21] Secure VFIO, TDISP, SEV TIO
From: Alexey Kardashevskiy <hidden>
Date: 2024-08-30 04:38:28
Also in:
kvm, linux-iommu, linux-pci
On 30/8/24 09:41, Dan Williams wrote:
Alexey Kardashevskiy wrote: [..]quoted
quoted
quoted
- skipping various enforcements of non-SME or SWIOTLB in the guest;Is this based on some concept of private vs shared mode devices?quoted
No mixed share+private DMA supported within the same IOMMU.What does this mean? A device may not have mixed mappings (makes sense),Currently devices do not have an idea about private host memory (but it is being worked on afaik).Worked on where? You mean the PCI core indicating that a device is private or not? Is that not indicated by guest-side TSM connection state?quoted
quoted
quoted
or an IOMMU can not host devices that do not all agree on whether
DMA is
quoted
quoted
private or shared?The hardware allows that via hardware-assisted vIOMMU and I/O page tables in the guest with C-bit takes into accound by the IOMMU but the software support is missing right now. So for this initial drop, vTOM is used for DMA - this thing says "everything below <addr> is private, above <addr> - shared" so nothing needs to bother with the C-bit, and in my exercise I set the <addr> to the allowed maximum. So each IOMMUFD instance in the VM is either "all private mappings" or "all shared". Could be half/half by moving that <addr> :)I thought existing use cases assume that the CC-VM can trigger page conversions at will without regard to a vTOM concept? It would be nice to have that address-map separation arrangement, has not that ship already sailed?
Mmm. I am either confusing you too much or not following you :) Any page
can be converted, the proposed arrangement would require that
convertion-candidate-pages are allocated from a specific pool?
There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I
use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this
exercise is not using it). Which one do you mean?
[..]quoted
quoted
Would the device not just launch in "shared" mode until it is later converted to private? I am missing the detail of why passing the device on the command line requires that private memory be mapped early.A sequencing problem. QEMU "realizes" a VFIO device, it creates an iommufd instance which creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). And DTE is not updated after than. For secure stuff, DTE needs to be slightly different. So right then I tell IOMMUFD that it will handle private memory. Then, the same VFIO "realize" handler maps the guest memory in iommufd. I use the same flag (well, pointer to kvm) in the iommufd pinning code, private memory is pinned and mapped (and related page state change happens as the guest memory is made guest-owned in RMP). QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual place changed recenly, huh) and the latter will measure the guest and try making all guest memory private but it already happened => error. I think I have to decouple the pinning and the IOMMU/DTE setting.quoted
That said, the implication that private device assignment requires hotplug events is a useful property. This matches nicely with initial thoughts that device conversion events are violent and might as well be unplug/replug events to match all the assumptions around what needs to be updated.For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that it is going to be private so there should be no massive conversion.That's a SEV-TIO RFC-specific hack, or a proposal?
Not sure at the moment :)
An approach that aligns more closely with the VFIO operational model, where it maps and waits for guest faults / usages, is that QEMU would be told that the device is "bind capable", because the host is not in a position to assume how the guest will use the device. A "bind capable" device operates in shared mode unless and until the guest triggers private conversion.
True. I just started this exercise without QEMU DiscardManager. Now I rely on it but it either needs to allow dynamic flip from discarded==private to discarded==shared (should do for now) or allow 3 states for guest pages.
quoted
quoted
quoted
This requires the BME hack as MMIO andNot sure what the "BME hack" is, I guess this is foreshadowing for later in this story.>quoted
quoted
BusMaster enable bits cannot be 0 after MMIO validation is doneIt would be useful to call out what is a TDISP requirement, vs device-specific DSM vs host-specific TSM requirement. In this case I assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs mustOh there is 6.2 already.quoted
enter the TDISP ERROR state if BME is cleared after the device is locked? ...but this begs the question of whether it needs to be avoided outrightWell, besides a couple of avoidable places (like testing INTx support which we know is not going to work on VFs anyway), a standard driver enables MSE first (and the value for the command register does not have 1 for BME) and only then BME. TBH I do not think writing BME=0 when BME=0 already is "clearing" but my test device disagrees....but we should not be creating kernel policy around test devices. What matters is real devices. Now, if it is likely that real / production devices will go into the TDISP ERROR state by not coalescing MSE + BME updates then we need a solution.
True but I do not even know who to ask this question :)
Given it is unlikely that TDISP support will be widespread any time soon
it is likely tenable to assume TDISP compatible drivers call a new:
pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR);
...or something like that to coalesce command register writes.
Otherwise if that retrofit ends up being too much work or confusion then
the ROI of teaching the PCI core to recover this scenario needs to be
evaluated.Agree.
quoted
quoted
or handled as an error recovery case dependending on policy.Avoding seems more straight forward unless we actually want enlightened device drivers which want to examine the interface report before enabling the device. Not sure.If TDISP capable devices trends towards a handful of devices in the near term then some driver fixups seems reasonable. Otherwise if every PCI device driver Linux has ever seens needs to be ready for that device to have a TDISP capable flavor then mitigating this in the PCI core makes more sense than playing driver whack-a-mole.
>
quoted
quoted
quoted
the guest OS booting process when this appens. SVSM could help addressing these (not implemented at the moment).At first though avoiding SVSM entanglements where the kernel can be enlightened shoud be the policy. I would only expect SVSM hacks to cover for legacy OSes that will never be TDISP enlightened, but in that case we are likely talking about fully unaware L2. Lets assume fully enlightened L1 for now.Well, I could also tweak OVMF to make necessary calls to the PSP and hack QEMU to postpone the command register updates to get this going, just a matter of ugliness.Per above, the tradeoff should be in ROI, not ugliness. I don't see how OVMF helps when devices might be being virtually hotplugged or reset.
I have no clue how exactly hotplug works on x86, is not BIOS playing role in it? Thanks, -- Alexey