Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support | linux-arm-kernel

quoted

On Thu, Jan 16, 2020 at 3:23 PM Auger Eric [off-list ref] wrote:
Hi Rob,

On 1/16/20 5:57 PM, Rob Herring wrote:
On Wed, Jan 15, 2020 at 10:33 AM Auger Eric [off-list ref] wrote:
Hi Rob,

On 1/15/20 3:02 PM, Rob Herring wrote:
On Wed, Jan 15, 2020 at 3:21 AM Auger Eric [off-list ref] wrote:
Hi Rob,

On 1/13/20 3:39 PM, Rob Herring wrote:
Arm SMMUv3.2 adds support for TLB range invalidate operations.
Support for range invalidate is determined by the RIL bit in the IDR3
register.

The range invalidate is in units of the leaf page size and operates on
1-32 chunks of a power of 2 multiple pages. First we determine from the
size what power of 2 multiple we can use and then adjust the granule to
32x that size.

@@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
              cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
      }

+     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+             unsigned long tg, scale;
+
+             /* Get the leaf page size */
+             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
it is unclear to me why you can't set tg with the granule parameter.
granule could be 2MB sections if THP is enabled, right?
Ah OK I thought it was a page size and not a block size.

I requested this feature a long time ago for virtual SMMUv3. With
DPDK/VFIO the guest was sending page TLB invalidation for each page
(granule=4K or 64K) part of the hugepage buffer and those were trapped
by the VMM. This stalled qemu.
I did some more testing to make sure THP is enabled, but haven't been
able to get granule to be anything but 4K. I only have the Fast Model
with AHCI on PCI to test this with. Maybe I'm hitting some place where
THPs aren't supported yet.

+             /* Determine the power of 2 multiple number of pages */
+             scale = __ffs(size / (1UL << tg));
+             cmd.tlbi.scale = scale;
+
+             cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
How's this:
/* The invalidation loop defaults to the maximum range */
I would have expected num=0 directly. Don't we invalidate the &size in
one shot as 2^scale * pages of granularity @tg? I fail to understand
when NUM > 0.
NUM is > 0 anytime size is not a power of 2. For example, if size is
33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
size is 34 pages, then NUM is (17-1) and SCALE is 1.
OK I get it now. I misread the scale computation as log2() :-(.

I still have a doubt about the scale choice. What if you invalidate a
large number of pages such as 1025 pages. scale is 0 and you end up with
32 * 32 * 2^0 + 1 * 2 * 2^0  invalidations (33). Whereas you could
invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
1*1^1 (packing the invalidations by largest scale). Am I correct or do I
still miss something?
No, that's correct. 33 is a lot better than 1025 though. :) 1023 pages
is about the worst case if we assume we get 2MB blocks, but maybe not
a good assumption given our testing so far...

So thinking out loud, I guess we could iterate on power of 2 chunks of
size (in units of pages) like this:

while (size) {
  scale = fls(size);
  range = 1 << scale;
  size &= ~range;

  iova += range;
}

But that means NUM is always 0, so also not ideal. So we need to
extract 5 bits from size for NUM on each iteration:

while (size) {
  scale = __ffs(size);
  num = (size >> scale)) & 0x1f;
  size -= (num + 1) * (1 << scale);

  ...
}

So worst case, we'd have 4 invalidates for up to 4G.

Besides in the patch I think in the while loop the iova should be
incremented with the actual number of invalidated bytes and not the max
sized granule variable.
Ok.

Rob

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help