Thread (26 messages) 26 messages, 4 authors, 2021-02-27

Re: [PATCH v12 10/10] iommu/arm-smmu-v3: Add stall support for platform devices

From: Zhou Wang <wangzhou1@hisilicon.com>
Date: 2021-02-27 03:40:53
Also in: linux-acpi, linux-devicetree, linux-iommu

On 2021/2/27 0:29, Jean-Philippe Brucker wrote:
Hi Zhou,

On Fri, Feb 26, 2021 at 05:43:27PM +0800, Zhou Wang wrote:
quoted
On 2021/2/1 19:14, Jean-Philippe Brucker wrote:
quoted
Hi Zhou,

On Mon, Feb 01, 2021 at 09:18:42AM +0800, Zhou Wang wrote:
quoted
quoted
@@ -1033,8 +1076,7 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
 			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
 			CTXDESC_CD_0_V;
 
-		/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
-		if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
+		if (smmu_domain->stall_enabled)
Could we add ssid checking here? like: if (smmu_domain->stall_enabled && ssid).
The reason is if not CD.S will also be set when ssid is 0, which is not needed.
Some drivers may want to get stall events on SSID 0:
https://lore.kernel.org/kvm/20210125090402.1429-1-lushenming@huawei.com/#t (local)

Are you seeing an issue with stall events on ssid 0?  Normally there
shouldn't be any fault on this context, but if they happen and no handler
is registered, the SMMU driver will just abort them and report them like a
non-stall event.
Hi Jean,

I notice that there is problem. In my case, I expect that CD0 is for kernel
and other CDs are for user space. Normally there shouldn't be any fault in
kernel, however, we have RAS case which is for some reason there may has
invalid address access from hardware device.

So at least there are two different address access failures: 1. hardware RAS problem;
2. software fault fail(e.g. kill process when doing DMA). Handlings for these
two are different: for 1, we should reset hardware device; for 2, stop related
DMA is enough.
Right, and in case 2 there should be no report printed since it can be
triggered by user, while you probably want to be loud in case 1.
quoted
Currently if SMMU returns the same signal(by SMMU resume abort), master device
driver can not tell these two kinds of cases.
This part I don't understand. So the SMMU sends a RESUME(abort) command,
and then the master reports the DMA error to the device driver, which
cannot differentiate 1 from 2?  (I guess there is no SSID in this report?)
But how does disabling stall change this?  The invalid DMA access will
still be aborted by the SMMU.
This is about the hardware design. In D06 board, an invalid DMA access from
accelerator devices will be aborted, and an hardware error signal will be
returned to accelerator devices, which reports it as a RAS error irq.
while for the stall case, error signal triggered by SMMU resume abort is
also reported as same RAS error irq. This is problem in D60 board.

In next generation of hardware, a new irq will be added to report SMMU resume
abort information, it works with related registers in accelerator devices to
get related hardware queue, which need to be stopped.

So if CD0.S is 1, invalid DMA access in kernel will be reported into above
new added irq, which has not enough information to tell RAS errors(there are 10+
hardware RAS errors) from SMMU resume abort.
Hypothetically, would it work if all stall events that could not be
handled went to the device driver?  Those reports would contain the SSID
(or lack thereof), so you could reset the device in case 1 and ignore case
2. Though resetting the device in the middle of a stalled transaction
As above, it is hard to tell RAS errors and SMMU resume abort in SMMU resume abort
now :(
probably comes with its own set of problems.
quoted
From the basic concept, if a CD is used for kernel, its S bit should not be set.
How about we add iommu domain check here too, if DMA domain we do not set S bit for
CD0, if unmanaged domain we set S bit for all CDs?
I think disabling stall for CD0 of a DMA domain makes sense in general,
even though I don't really understand how that fixes your issue. But
As above, if disabling stall for CD0, an invalid DMA access will be handled
by RAS error irq.
someone might come up with a good use-case for receiving stall events on
If A DMA access in kernel fails, I think there should be a RAS issue :)
So better to disable CD0 stall for DMA domain.

Best,
Zhou
DMA mappings, so I'm wondering whether the alternative solution where we
report unhandled stall events to the device driver would also work for
you.

Thanks,
Jean

.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help