Thread (268 messages) 268 messages, 15 authors, 2021-06-08

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Liu Yi L <hidden>
Date: 2021-05-11 13:26:14
Also in: linux-iommu, lkml

On Tue, 11 May 2021 09:10:03 +0000, Tian, Kevin wrote:
quoted
From: Jason Gunthorpe
Sent: Monday, May 10, 2021 8:37 PM
  
[...] 
quoted
quoted
gPASID!=hPASID has a problem when assigning a physical device which
supports both shared work queue (ENQCMD with PASID in MSR)
and dedicated work queue (PASID in device register) to a guest
process which is associated to a gPASID. Say the host kernel has setup
the hPASID entry with nested translation though /dev/ioasid. For
shared work queue the CPU is configured to translate gPASID in MSR
into **hPASID** before the payload goes out to the wire. However
for dedicated work queue the device MMIO register is directly mapped
to and programmed by the guest, thus containing a **gPASID** value
implying DMA requests through this interface will hit IOMMU faults
due to invalid gPASID entry. Having gPASID==hPASID is a simple
workaround here. mdev doesn't have this problem because the
PASID register is in emulated control-path thus can be translated
to hPASID manually by mdev driver.  
This all must be explicit too.

If a PASID is allocated and it is going to be used with ENQCMD then
everything needs to know it is actually quite different than a PASID
that was allocated to be used with a normal SRIOV device, for
instance.

The former case can accept that the guest PASID is virtualized, while
the lattter can not.

This is also why PASID per RID has to be an option. When I assign a
full SRIOV function to the guest then that entire RID space needs to
also be assigned to the guest. Upon migration I need to take all the
physical PASIDs and rebuild them in another hypervisor exactly as is.

If you force all RIDs into a global PASID pool then normal SRIOV
migration w/PASID becomes impossible. ie ENQCMD breaks everything else
that should work.

This is why you need to sort all this out and why it feels like some
of the specs here have been mis-designed.

I'm not sure carving out ranges is really workable for migration.

I think the real answer is to carve out entire RIDs as being in the
global pool or not. Then the ENQCMD HW can be bundled together and
everything else can live in the natural PASID per RID world.
  
OK. Here is the revised scheme by making it explicitly.

There are three scenarios to be considered:

1) SR-IOV (AMD/ARM):
	- "PASID per RID" with guest-allocated PASIDs;
	- PASID table managed by guest (in GPA space);
	- the entire PASID space delegated to guest;
	- no need to explicitly register guest-allocated PASIDs to host;
	- uAPI for attaching PASID table:

    // set to "PASID per RID"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

    // When Qemu captures a new PASID table through vIOMMU;
    pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);

    // Set the PASID table to the RID associated with pasidtbl_ioasid;
    ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);

2) SR-IOV, no ENQCMD (Intel):
	- "PASID per RID" with guest-allocated PASIDs;
	- PASID table managed by host (in HPA space);
	- the entire PASID space delegated to guest too;
	- host must be explicitly notified for guest-allocated PASIDs;
	- uAPI for binding user-allocated PASIDs:

    // set to "PASID per RID"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

    // When Qemu captures a new PASID allocated through vIOMMU;
Is this achieved by VCMD or by capturing guest's PASID cache invalidation?
    pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);

    // Tell the kernel to associate pasid to pgtbl_ioasid in internal structure;
    // &pasid being a pointer due to a requirement in scenario-3
    ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &pasid);

    // Set guest page table to the RID+pasid associated to pgtbl_ioasid
    ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);

3) SRIOV, ENQCMD (Intel):
	- "PASID global" with host-allocated PASIDs;
	- PASID table managed by host (in HPA space);
	- all RIDs bound to this ioasid_fd use the global pool;
	- however, exposing global PASID into guest breaks migration;
	- hybrid scheme: split local PASID range and global PASID range;
	- force guest to use only local PASID range (through vIOMMU);
	- for ENQCMD, configure CPU to translate local->global;
	- for non-ENQCMD, setup both local/global pasid entries;
	- uAPI for range split and CPU pasid mapping:

    // set to "PASID global"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);

    // split local/global range, applying to all RIDs in this fd
    // Example: local [0, 1024), global [1024, max)
    // local PASID range is managed by guest and migrated as VM state
    // global PASIDs are re-allocated and mapped to local PASIDs post migration
    ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);

    // When Qemu captures a new local_pasid allocated through vIOMMU;
    pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);

    // Tell the kernel to associate local_pasid to pgtbl_ioasid in internal structure;
    // Due to HWID_GLOBAL, the kernel also allocates a global_pasid from the
    // global pool. From now on, every hwid related operations must be applied
    // to both PASIDs for this page table;
    // global_pasid is returned to userspace in the same field as local_pasid;
    ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &local_pasid);

    // Qemu then registers local_pasid/global_pasid pair to KVM for setting up
    // CPU PASID translation table;
    ioctl(kvm_fd, KVM_SET_PASID_MAPPING, local_pasid, global_pasid);

    // Set guest page table to the RID+{local_pasid, global_pasid} associated 
    // to pgtbl_ioasid;
    ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);

-----
Notes:

I tried to keep common commands in generic format for all scenarios, while
introducing new commands for usage-specific requirement. Everything is
made explicit now. 

The userspace has sufficient information to choose its desired scheme based 
on vIOMMU types and platform information (e.g. whether ENQCMD is exposed 
in virtual CPUID, whether assigned devices support DMWr, etc.). 

Above example assumes one RID per bound page table, because vIOMMU
identifies new guest page tables per-RID. If there are other usages requiring
multiple RIDs per page table, SET_HWID/BIND_PGTABLE could accept
another device_handle parameter to specify which RID is targeted for this
operation.

When considering SIOV/mdev there is no change to above uAPI sequence. 
It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it
cause any change to 3) regarding to the split range scheme. The only
 conceptual change is in 2), where although it's still "PASID per RID" the 
PASIDs must be managed by host because the parent driver also allocates 
PASIDs from per-RID space to mark mdev (RID+PASID). But this difference 
doesn't change the uAPI flow - just treat user-provisioned PASID as 'virtual' 
and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the 
real one when programming PASID entry (IOASID_BIND_PGTABLE) or device 
PASID register (converted in the mediation path).

If all above can work reasonably, we even don't need the special VCMD 
interface in VT-d for guest to allocate PASIDs from host. Just always let
the guest to manage its PASIDs (with restriction of available local PASIDs),
being a global allocator or per-RID allocator. the vIOMMU side just stick
to the per-RID emulation according to the spec. 
yeah, if this scheme for scenario 3) is good. We may limit the range of
local PASIDs by limiting the PASID bit width of vIOMMU. QEMU can get the
local PASID allocated by guest IOMMU when guest does PASID cache invalidation.

-- 
Regards,
Yi Liu
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help