Thread (30 messages) 30 messages, 5 authors, 2021-08-10

Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

From: Ming Lei <hidden>
Date: 2021-07-23 10:21:29
Also in: linux-iommu, linux-nvme, lkml

On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote:
On 2021-07-22 16:54, Ming Lei wrote:
[...]
quoted
quoted
If you are still keen to investigate more, then can try either of these:

- add iommu.strict=0 to the cmdline

- use perf record+annotate to find the hotspot
   - For this you need to enable psuedo-NMI with 2x steps:
     CONFIG_ARM64_PSEUDO_NMI=y in defconfig
     Add irqchip.gicv3_pseudo_nmi=1

     See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
     Your kernel log should show:
     [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
synchronisation
OK, will try the above tomorrow.
Thanks, I was also going to suggest the latter, since it's what
arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
indicative of where the slowness most likely stems from.
The improvement from 'iommu.strict=0' is very small:

[root@ampere-mtjade-04 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0

[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021
  read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec)

[root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021
  read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec)
FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU
overhead for both cases since it should effectively mean only 1/256 as many
invalidations are issued.

Could you also check whether the SMMU platform devices have "numa_node"
properties exposed in sysfs (and if so whether the values look right), and
share all the SMMU output from the boot log?
No found numa_node attribute for smmu platform device, and the whole dmesg log is
attached.


Thanks, 
Ming

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help