Thread (53 messages) 53 messages, 5 authors, 2018-02-23

Re: [PATCH v7 6/6] drm/msm: iommu: Replace runtime calls with runtime suppliers

From: Tomasz Figa <tfiga@chromium.org>
Date: 2018-02-22 14:12:19
Also in: dri-devel, linux-arm-msm, linux-iommu, linux-pm, lkml

On Thu, Feb 22, 2018 at 10:45 PM, Robin Murphy [off-list ref] wrote:
[sorry, I had intended to reply sooner but clearly forgot]


On 16/02/18 00:13, Tomasz Figa wrote:
quoted
On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy [off-list ref]
wrote:
quoted
On 15/02/18 04:17, Tomasz Figa wrote:
[...]
quoted
quoted

Could you elaborate on what kind of locking you are concerned about?
As I explained before, the normally happening fast path would lock
dev->power_lock only for the brief moment of incrementing the runtime
PM usage counter.


My bad, that's not even it.

The atomic usage counter is incremented beforehands, without any
locking [1] and the spinlock is acquired only for the sake of
validating that device's runtime PM state remained valid indeed [2],
which would be the case in the fast path of the same driver doing two
mappings in parallel, with the master powered on (and so the SMMU,
through device links; if master was not powered on already, powering
on the SMMU is unavoidable anyway and it would add much more latency
than the spinlock itself).


We now have no locking at all in the map path, and only a per-domain lock
around TLB sync in unmap which is unfortunately necessary for
correctness;
the latter isn't too terrible, since in "serious" hardware it should only
be
serialising a few cpus serving the same device against each other (e.g.
for
multiple queues on a single NIC).

Putting in a global lock which serialises *all* concurrent map and unmap
calls for *all* unrelated devices makes things worse. Period. Even if the
lock itself were held for the minimum possible time, i.e. trivially
"spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing
that
one cache line around between 96 CPUs across two sockets is not
negligible.

Fair enough. Note that we're in a quite interesting situation now:
  a) We need to have runtime PM enabled on Qualcomm SoC to have power
properly managed,
  b) We need to have lock-free map/unmap on such distributed systems,
  c) If runtime PM is enabled, we need to call into runtime PM from any
code that does hardware accesses, otherwise the IOMMU API (and so DMA
API and then any V4L2 driver) becomes unusable.

I can see one more way that could potentially let us have all the
three. How about enabling runtime PM only on selected implementations
(e.g. qcom,smmu) and then having all the runtime PM calls surrounded
with if (pm_runtime_enabled()), which is lockless?

Yes, that's the kind of thing I was gravitating towards - my vague thought
was adding some flag to the smmu_domain, but pm_runtime_enabled() does look
conceptually a lot cleaner.
Great, thanks. Looks like we're in agreement now. \o/

Vivek, does this sound reasonable to you?

Best regards,
Tomasz
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help