[PATCH v2 01/40] iommu: Introduce Shared Virtual Addressing API
From: christian.koenig@amd.com (Christian König)
Date: 2018-09-12 12:56:21
Also in:
kvm, linux-acpi, linux-devicetree, linux-iommu, linux-mm, linux-pci
Am 12.09.2018 um 14:40 schrieb Jean-Philippe Brucker:
On 08/09/2018 08:29, Christian K?nig wrote:quoted
Yes, exactly. I just need a PASID which is never used by the OS for a process and we can easily give that back when the last FD reference is closed.Alright, iommu-sva can get its PASID from this external allocator as well, as long as it has an interface similar to idr. Where would it go, drivers/base/, mm/, kernel/...?
Good question, my initial instinct was to put it under drivers/pci. But AFAIKS now you are supporting SVA implementations which are not based on PCI. So drivers/base sounds like a good place to me.
quoted
quoted
quoted
quoted
The process dies, iommu-sva is notified and calls the mm_exit() function passed by the device driver to iommu_sva_device_init(). In mm_exit() the device driver needs to clear any reference to the PASID in hardware and in its own structures. When the device driver returns from mm_exit(), it effectively tells the core that it has finished using the PASID, and iommu-sva can reuse the PASID for another process. mm_exit() is allowed to block, so the device driver has time to clean up and flush the queues. If the device driver finishes using the PASID before the process exits, it just calls unbind().Exactly that's what Michal Hocko is probably going to not like at all. Can we have a different approach where each driver is informed by the mm_exit(), but needs to explicitly call unbind() before a PASID is reused?It's awful from the IOMMU driver perspective. In addition to "enabled" and "disabled" PASID states, you add "disabled but DMA still running normally". Between that new state and "disabled", the IOMMU will be flooded by translation faults (non-recoverable ones), which it needs to ignore instead of reporting to the kernel. Not all IOMMUs can deal with this in hardware (SMMU and VT-d can quiesce translation faults per-PASID, but I don't think AMD IOMMU can.) Some drivers will have to filter fault events themselves, depending on the PASID state.
Puh, yeah that is probably true. Ok let us skip that for a moment, we just need to invest more work in killing DMA operations quickly when the process address space is teared down.
quoted
quoted
quoted
During that teardown transition it would be ideal if that PASID only points to a dummy root page directory with only invalid entries.I guess this can be vendor specific, In VT-d I plan to mark PASID entry not present and disable fault reporting while draining remaining activities.Sounds good to me. Point is at least in the case where the process was killed by the OOM killer we should not block in mm_exit(). Instead operations issued by the process to a device driver which uses SVA needs to be terminated as soon as possible to make sure that the OOM killer can advance.I don't see how we're preventing the OOM killer from advancing, so I'm looking for a stronger argument that justifies adding this complexity to IOMMU drivers. Time limit of the release MMU notifier, locking requirement, a concrete example where things break, even a comment somewhere in mm/ would do... In my tests I can't manage to disturb the OOM killer, but I could be missing some special case. Even if the mm_exit() callback (unrealistically) sleeps for 60 seconds,
Well you are *COMPLETELY* under estimating this. A compute task with a huge wave launch can take multiple minutes to tear down. That's why I'm so concerned about that, but to be honest I think that just the hardware needs to become better and we need to be able to block dead tasks from spawning threads again. Regards, Christian.
the OOM killer isn't affected because oom_reap_task_mm() wipes the victim's address space in another thread, either before or while the release notifier is running. Thanks, Jean