Re: [PATCH v2] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support | linux-arm-kernel

quoted

On Thu, Jan 30, 2020 at 9:06 AM Auger Eric [off-list ref] wrote:
Hi Rob,
On 1/17/20 10:16 PM, Rob Herring wrote:
Arm SMMUv3.2 adds support for TLB range invalidate operations.
Support for range invalidate is determined by the RIL bit in the IDR3
register.

The range invalidate is in units of the leaf page size and operates on
1-32 chunks of a power of 2 multiple pages. First, we determine from the
size what power of 2 multiple we can use. Then we calculate how many
chunks (1-31) of the power of 2 size for the range on the iteration. On
each iteration, we move up in size by at least 5 bits.

Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <redacted>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Rob Herring <robh@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 66 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

@@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
 {
      u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
      struct arm_smmu_device *smmu = smmu_domain->smmu;
-     unsigned long start = iova, end = iova + size;
+     unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
      int i = 0;
      struct arm_smmu_cmdq_ent cmd = {
              .tlbi = {
@@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
              cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
      }

+     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+             /* Get the leaf page size */
+             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+             /* Convert page size of 12,14,16 (log2) to 1,2,3 */
+             cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
+
+             /* Determine what level the granule is at */
+             cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
+
+             num_pages = size / (1UL << tg);
+     }
+
      while (iova < end) {
              if (i == CMDQ_BATCH_ENTRIES) {
                      arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, false);
                      i = 0;
              }

+             if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+                     /*
+                      * On each iteration of the loop, the range is 5 bits
+                      * worth of the aligned size remaining.
+                      * The range in pages is:
+                      *
+                      * range = (num_pages & (0x1f << __ffs(num_pages)))
+                      */
+                     unsigned long scale, num;
+
+                     /* Determine the power of 2 multiple number of pages */
+                     scale = __ffs(num_pages);
+                     cmd.tlbi.scale = scale;
+
+                     /* Determine how many chunks of 2^scale size we have */
+                     num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
+                     cmd.tlbi.num = num - 1;
+
+                     /* range is num * 2^scale * pgsize */
+                     granule = num << (scale + tg);
+
+                     /* Clear out the lower order bits for the next iteration */
+                     num_pages -= num << scale;
Regarding the 2 options given in
https://lore.kernel.org/linux-arm-kernel/CAL_JsqKABoE+0crGwyZdNogNgEoG=MOOpf6deQgH6s73c0UNdA@mail.gmail.com/raw (local),

I understand you implemented 2) but I still do not understand why you
preferred that one against 1).

In your case of 1023*4k pages this will invalidate by 31 32*2^0*4K +
31*2^0*4K pages
whereas you could achieve that with 10 invalidations with the 1st algo.
I did not get the case where it is more efficient. Please can you detail.
No, it's only 2 commands. We do 31*4K and then 31*2^5*4K. Here's a the
output of a test case:

iova=10001000, num_pages=0x3e0, granule=1f000, num=31, scale=0, ttl=3
iova=10020000, num_pages=0x0, granule=3e0000, num=31, scale=5, ttl=3

(num_pages being what's left at end of the loop)

As I mentioned on v1, worst case is 4 commands for up to 4GB. It's
20-bits of size (32-12) and each loop processes a minimum of 5 bits.
Each loop becomes a larger aligned size, so scale goes up each pass.
This is what I tried to explain in the top comment.

Rob

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help