Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20

From: Matthew Rosato <mjrosato@linux.ibm.com>
Date: 2024-08-13 18:49:55
Also in: linux-iommu, lkml

On 8/6/24 9:35 AM, Niklas Schnelle wrote:

On Mon, 2024-08-05 at 20:19 +0800, Yunsheng Lin wrote:

quoted

On 2024/7/31 16:42, Somnath Kotur wrote:

quoted

On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer [off-list ref] wrote:

quoted

+cc iommu maintainers and list

quoted

On 30/07/2024 15.08, Yonglong Liu wrote:

quoted

I found a bug when running hns3 driver with page pool enabled, the log
as below:

[ 4406.956606] Unable to handle kernel NULL pointer dereference at
virtual address 00000000000000a8

struct iommu_domain *iommu_get_dma_domain(struct device *dev)
{
        return dev->iommu_group->default_domain;
}

$ pahole -C iommu_group --hex | grep default_domain
        struct iommu_domain *      default_domain;   /*  0xa8   0x8 */

Looks like iommu_group is a NULL pointer (that when deref member
'default_domain' cause this fault).

quoted

[ 4406.965379] Mem abort info:
[ 4406.968160]   ESR = 0x0000000096000004
[ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 4406.977218]   SET = 0, FnV = 0
[ 4406.980258]   EA = 0, S1PTW = 0
[ 4406.983404]   FSC = 0x04: level 0 translation fault
[ 4406.988273] Data abort info:
[ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
[ 4407.013430] [00000000000000a8] pgd=0000000000000000,
p4d=0000000000000000
[ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
xhci_pci_renesas uacce libsas [last unloaded: hnae3]
[ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
[ 4407.093343] Workqueue: events page_pool_release_retry
[ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
[ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
[ 4407.114255] sp : ffff80008bacbc80
[ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
ffffc31806be7000
[ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
0000000000000002
[ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
00000000fcd7c000
[ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
ffff8000d3503c58
[ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
0000000000000001
[ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
000006b10004e7fb
[ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
ffffc3180405cd20
[ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
0000000000000010
[ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
0000000000000002
[ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
0000000000000000
[ 4407.188589] Call trace:
[ 4407.191027]  iommu_get_dma_domain+0xc/0x20
[ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
[ 4407.199361]  page_pool_return_page+0x48/0x180
[ 4407.203699]  page_pool_release+0xd4/0x1f0
[ 4407.207692]  page_pool_release_retry+0x28/0xe8

I suspect that the DMA IOMMU part was deallocated and freed by the
driver even-though page_pool still have inflight packets.

When you say driver, which 'driver' do you mean?
I suspect this could be because of the VF instance going away with
this cmd - disable the vf: echo 0 >
/sys/class/net/eno1/device/sriov_numvfs, what do you think?

quoted

The page_pool bumps refcnt via get_device() + put_device() on the DMA
'struct device', to avoid it going away, but I guess there is also some
IOMMU code that we need to make sure doesn't go away (until all inflight
pages are returned) ???

I guess the above is why thing went wrong here, the question is which
IOMMU code need to be called here to stop them from going away.

What I am also curious is that there should be a pool of allocated iova in
iommu that is corresponding to the in-flight page for page_pool, shouldn't
iommu wait for the corresponding allocated iova to be freed similarly as
page_pool does for it's in-flight pages?


Is it possible you're using an IOMMU whose driver doesn't yet support
blocking_domain? I'm currently working an issue on s390 that also
occurs during device removal and is fixed by implementing blocking
domain in the s390 IOMMU driver (patch forthcoming). The root cause for
that is that our domain->ops->attach_dev() fails when during hot-unplug
the device is already gone from the platform's point of view and then
we ended up with a NULL domain unless we have a blocking domain which
can handle non existant devices and gets set as fallback in
__iommu_device_set_domain(). In the case I can reproduce the backtrace
is different[0] but we also saw at least two cases where we see the
exact same call trace as in the first mail of this thread. So far I
suspected them to be due to the blocking domain issue but it could be a
separate issue too.

Thanks,
Niklas

Couple of things to follow up with on Niklas' statement above...

So first, after further testing on my end, I wanted to clarify that the implementation of a blocked domain is unrelated to this bug.  Sorry for the noise.

Second, it looks like Niklas copied an unrelated backtrace in his report (I snipped it from my reply).

But I wanted to be clear, we can reproduce this same sort of error on s390 and using a different device driver (mlx5_core) and the backtrace is almost identical to what this thread is reporting and in the same area.  The most reliable repro method I've found so far is to use a few mellanox VFs and power one down (echo 0 > /sys/bus/pci/slots/.../power) during or shortly after a tcp workload (iperf3).  I verified that I can reproduce on a kernel as old as 6.7 release tag; I stopped there because that's the release when s390 converted to use dma-iommu but I assume the problem really existed longer than this.

Here's a backtrace of a repro on s390 (6.11-rc3):

[  691.860855] Unable to handle kernel pointer dereference in virtual kernel address space
[  691.861089] Failing address: 0706c00180000000 TEID: 0706c00180000803
[  691.861097] Fault in home space mode while using kernel ASCE.
[  691.861118] AS:0000000154fbc007 R3:0000000000000024 
[  691.861153] Oops: 0038 ilc:2 [#1] PREEMPT SMP 
[  691.861161] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables rpcrdma rdma_ucm rdma_cm iw_cm ib_cm uvdevic
e s390_trng ism eadm_sch sunrpc tape_34xx tape tape_class mlx5_ib vfio_ap ib_uverbs kvm ib_core zcrypt_cex4 vfio_ccw mdev vfio_iommu_type1 vfio sch_fq_codel loop dm_multipath nfnetlink lcs ctcm fsm mlx5_core ghash_s390 prng chacha_s390 aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s3
90 sha_common zfcp scsi_transport_fc scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt rng_core autofs4
[  691.861319] CPU: 9 UID: 0 PID: 283 Comm: kworker/9:2 Not tainted 6.11.0-rc3 #1
[  691.861325] Hardware name: IBM 8561 T01 772 (LPAR)
[  691.861329] Workqueue: events page_pool_release_retry
[  691.861342] Krnl PSW : 0704e00180000000 0000013bd3a6ee26 (iommu_iova_to_phys+0x6/0x40)
[  691.861355]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[  691.861363] Krnl GPRS: 0000000080000000 0000000000000000 0706c00180000000 0000000afba90000
[  691.861369]            0000000000001000 0000000000000002 0000000000000022 0000000000000002
[  691.861374]            0000000000001000 0000000afba90000 00000039c0e19000 00000039c0bc9098
[  691.861379]            000000427ac33400 00000039c0e19068 0000013bd3a7655c 000000bbd8d1bb28
[  691.861389] Krnl Code: 0000013bd3a6ee1c: 0707                bcr     0,%r7
[  691.861389]            0000013bd3a6ee1e: 0707                bcr     0,%r7
[  691.861389]           #0000013bd3a6ee20: c004002a8d6c        brcl    0,0000013bd3fc08f8
[  691.861389]           >0000013bd3a6ee26: 58502000            l       %r5,0(%r2)
[  691.861389]            0000013bd3a6ee2a: ec580014047e        cij     %r5,4,8,0000013bd3a6ee52
[  691.861389]            0000013bd3a6ee30: ec58000c007e        cij     %r5,0,8,0000013bd3a6ee48
[  691.861389]            0000013bd3a6ee36: e31020080004        lg      %r1,8(%r2)
[  691.861389]            0000013bd3a6ee3c: e31010400004        lg      %r1,64(%r1)
[  691.861426] Call Trace:
[  691.861430]  [<0000013bd3a6ee26>] iommu_iova_to_phys+0x6/0x40 
[  691.861436]  [<0000013bd2f47a32>] dma_unmap_page_attrs+0x1a2/0x1e0 
[  691.861443]  [<0000013bd3c2b81a>] page_pool_return_page+0x5a/0x130 
[  691.861449]  [<0000013bd3c2cb68>] page_pool_release+0xb8/0x1f0 
[  691.861455]  [<0000013bd3c2ce9c>] page_pool_release_retry+0x2c/0x120 
[  691.861461]  [<0000013bd2e95652>] process_one_work+0x2b2/0x5d0 
[  691.861467]  [<0000013bd2e9625e>] worker_thread+0x20e/0x3f0 
[  691.861473]  [<0000013bd2ea25e2>] kthread+0x152/0x170 
[  691.861478]  [<0000013bd2e135ac>] __ret_from_fork+0x3c/0x60 
[  691.861484]  [<0000013bd3ecf0ca>] ret_from_fork+0xa/0x38 
[  691.861491] INFO: lockdep is turned off.
[  691.861495] Last Breaking-Event-Address:
[  691.861499]  [<0000013bd3a76556>] iommu_dma_unmap_page+0x36/0xb0
[  691.861507] Kernel panic - not syncing: Fatal exception: panic_on_oops

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help