Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: 2024-08-06 09:51:36
Also in: linux-acpi, linux-iommu, lkml
Subsystem: networking [general], page pool, the rest · Maintainers: "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas, Linus Torvalds

Cc. IOMMU list+maintainers, question below...


On 02/08/2024 04.06, Yonglong Liu wrote:

On 2024/7/31 19:32, Yonglong Liu wrote:

quoted

On 2024/7/31 16:42, Somnath Kotur wrote:

quoted

On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer 
[off-list ref] wrote:

quoted


On 30/07/2024 15.08, Yonglong Liu wrote:

quoted

I found a bug when running hns3 driver with page pool enabled, the log
as below:

[ 4406.956606] Unable to handle kernel NULL pointer dereference at
virtual address 00000000000000a8

struct iommu_domain *iommu_get_dma_domain(struct device *dev)
{
         return dev->iommu_group->default_domain;
}

$ pahole -C iommu_group --hex | grep default_domain
         struct iommu_domain *      default_domain;   /* 0xa8   0x8 */

Looks like iommu_group is a NULL pointer (that when deref member
'default_domain' cause this fault).

quoted

[ 4406.965379] Mem abort info:
[ 4406.968160]   ESR = 0x0000000096000004
[ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 4406.977218]   SET = 0, FnV = 0
[ 4406.980258]   EA = 0, S1PTW = 0
[ 4406.983404]   FSC = 0x04: level 0 translation fault
[ 4406.988273] Data abort info:
[ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 4407.006985] user pgtable: 4k pages, 48-bit VAs, 
pgdp=0000202828326000
[ 4407.013430] [00000000000000a8] pgd=0000000000000000,
p4d=0000000000000000
[ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser 
libiscsi
scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
xhci_pci_renesas uacce libsas [last unloaded: hnae3]
[ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
[ 4407.093343] Workqueue: events page_pool_release_retry
[ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
[ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
[ 4407.114255] sp : ffff80008bacbc80
[ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
ffffc31806be7000
[ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
0000000000000002
[ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
00000000fcd7c000
[ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
ffff8000d3503c58
[ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
0000000000000001
[ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
000006b10004e7fb
[ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
ffffc3180405cd20
[ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
0000000000000010
[ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
0000000000000002
[ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
0000000000000000
[ 4407.188589] Call trace:
[ 4407.191027]  iommu_get_dma_domain+0xc/0x20
[ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
[ 4407.199361]  page_pool_return_page+0x48/0x180
[ 4407.203699]  page_pool_release+0xd4/0x1f0
[ 4407.207692]  page_pool_release_retry+0x28/0xe8

I suspect that the DMA IOMMU part was deallocated and freed by the
driver even-though page_pool still have inflight packets.

When you say driver, which 'driver' do you mean?
I suspect this could be because of the VF instance going away with
this cmd - disable the vf: echo 0 >
/sys/class/net/eno1/device/sriov_numvfs, what do you think?

I found that this happen when the vf enabled and running some packets, 
below is more infomation:


[ 4391.596558] pci 0000:7d:01.0: page_pool_release_retry() stalled 
pool shutdown: id 1145, 33 inflight 906 sec
[ 4397.111484] hns3 0000:bd:00.0: SRIOV setting: 0
[ 4397.118155] hns3 0000:bd:01.0 enp189s0f0v0: link down
[ 4397.416623] hns3 0000:bd:01.0: finished uninitializing hclgevf driver
[ 4397.480572] pci 0000:7d:01.0: page_pool_release_retry() stalled 
pool shutdown: id 1279, 98 inflight 362 sec
[ 4400.948362] hns3 0000:7d:00.0: SRIOV setting: 1
[ 4401.060569] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
PCIe Endpoint
[ 4401.067795] pci 0000:7d:01.0: enabling Extended Tags
[ 4401.073090] hns3 0000:7d:01.0: Adding to iommu group 48
[ 4401.078494] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
[ 4401.084348] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
[ 4401.102911] hns3 0000:7d:01.0: finished initializing hclgevf driver
[ 4401.111212] hns3 0000:7d:01.0: using random MAC address 
da:**:**:**:a3:47
[ 4401.138375] hns3 0000:7d:01.0 eno1v0: renamed from eth0
[ 4403.939449] hns3 0000:7d:01.0 eno1v0: link up
[ 4403.940237] 8021q: adding VLAN 0 to HW filter on device eno1v0
[ 4406.956606] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000000000a8


another log:

[11550.197002] hns3 0000:bd:01.0 enp189s0f0v0: link up
[11550.197034] hns3 0000:bd:01.0 enp189s0f0v0: net open
[11550.206910] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
[11564.872929] page_pool_release_retry() stalled pool shutdown: id 
2330, 99 inflight 60 sec
[11568.353723] hns3 0000:bd:01.0 enp189s0f0v0: net stop
[11568.360723] hns3 0000:bd:01.0 enp189s0f0v0: link down
[11568.519899] hns3 0000:bd:01.0 enp189s0f0v0: link up
[11568.519935] hns3 0000:bd:01.0 enp189s0f0v0: net open
[11568.529840] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
[11589.640930] page_pool_release_retry() stalled pool shutdown: id 
1996, 50 inflight 2054 sec
[11592.554875] hns3 0000:bd:01.0 enp189s0f0v0: net stop
[11592.560930] hns3 0000:bd:01.0 enp189s0f0v0: link down
[11596.684935] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
PCIe Endpoint
[11596.692140] pci 0000:7d:01.0: enabling Extended Tags
[11596.697324] hns3 0000:7d:01.0: Adding to iommu group 48
[11596.702988] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
[11596.708808] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
[11596.727263] hns3 0000:7d:01.0: finished initializing hclgevf driver
[11596.735561] hns3 0000:7d:01.0: using random MAC address 
72:**:**:**:55:d7
[11596.760621] hns3 0000:7d:01.0 eno1v0: renamed from eth0
[11599.545341] hns3 0000:7d:01.0 eno1v0: link up
[11599.545409] hns3 0000:7d:01.0 eno1v0: net open
[11599.554858] 8021q: adding VLAN 0 to HW filter on device eno1v0
[11608.908922] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000000000a8

I adds more debug info, and found that:

[ 4573.356891] pci 0000:7d:01.0: page_pool_release_retry() stalled pool 
shutdown: id 1046, 82 inflight 60 sec, iommu_group=0x0

The iommu_group will release whether the page_pool is using it or not, 
so if once page_pool_return_page() was called(why does this occur when 
the device is reloaded and packets are transmitted?) , this crash will 
happen.

I try the follow patch, but doesn't work :(

The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
freed, make sense to me.

The question is if API iommu_group_get() and iommu_group_put() is the
correct API to use in this case?

quoted hunk ↗ jump to hunk

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f4444b4e39e6..d03a87407ca8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c

@@ -21,6 +21,7 @@
  #include <linux/poison.h>
  #include <linux/ethtool.h>
  #include <linux/netdevice.h>
+#include <linux/iommu.h>

The page_pool already have a system/workqueue that waits for inflight
"packet" pages, and calls struct device API get_device() and put_device().

Why didn't the patch add code together with struct device API?
Like this:

$ git diff

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2abe6e919224..686ff1d31aff 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c

@@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
         /* Driver calling page_pool_create() also call

page_pool_destroy() */
         refcount_set(&pool->user_cnt, 1);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_get(pool->p.dev);
                 get_device(pool->p.dev);
+       }

         return 0;
  }

@@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
  {
         ptr_ring_cleanup(&pool->ring, NULL);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_put(pool->p.dev->iommu_group);
                 put_device(pool->p.dev);
+       }


--Jesper

quoted hunk ↗ jump to hunk

  #include <trace/events/page_pool.h>
 > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
page_pool_params *params, int cpuid)
         if (err)
                 goto err_uninit;

+       if (pool->dma_map)
+               iommu_group_get(pool->p.dev);
+
         return pool;

  err_uninit:

@@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)

         page_pool_scrub(pool);
         inflight = page_pool_inflight(pool, true);
-       if (!inflight)
+       if (!inflight) {
                 __page_pool_destroy(pool);
+               if (pool->dma_map)
+ iommu_group_put(pool->p.dev->iommu_group);
+       }

         return inflight;
  }

quoted

The page_pool bumps refcnt via get_device() + put_device() on the DMA
'struct device', to avoid it going away, but I guess there is also some
IOMMU code that we need to make sure doesn't go away (until all 
inflight
pages are returned) ???

quoted

[ 4407.212119] process_one_work+0x164/0x3e0
[ 4407.216116]  worker_thread+0x310/0x420
[ 4407.219851]  kthread+0x120/0x130
[ 4407.223066]  ret_from_fork+0x10/0x20
[ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
[ 4407.232697] ---[ end trace 0000000000000000 ]---


The hns3 driver use page pool like this, just call once when the 
driver
initialize:

static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
{
      struct page_pool_params pp_params = {
          .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
                  PP_FLAG_DMA_SYNC_DEV,
          .order = hns3_page_order(ring),
          .pool_size = ring->desc_num * hns3_buf_size(ring) /
                  (PAGE_SIZE << hns3_page_order(ring)),
          .nid = dev_to_node(ring_to_dev(ring)),
          .dev = ring_to_dev(ring),
          .dma_dir = DMA_FROM_DEVICE,
          .offset = 0,
          .max_len = PAGE_SIZE << hns3_page_order(ring),
      };

      ring->page_pool = page_pool_create(&pp_params);
      if (IS_ERR(ring->page_pool)) {
          dev_warn(ring_to_dev(ring), "page pool creation failed: 
%ld\n",
               PTR_ERR(ring->page_pool));
          ring->page_pool = NULL;
      }
}

And call page_pool_destroy(ring->page_pool)  when the driver 
uninitialized.


We use two devices, the net port connect directory, and the step of 
the
test case like below:

1. enable a vf of '7d:00.0':  echo 1 >
/sys/class/net/eno1/device/sriov_numvfs

2. use iperf to produce some flows(the problem happens to the side 
which
runs 'iperf -s')

3. use ifconfig down/up to the vf

4. kill iperf

5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs

6. run 1~5 with another port bd:00.0

7. repeat 1~6


And when running this test case, we can found another related 
message (I
replaced pr_warn() to dev_warn()):

pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
949, 98 inflight 1449 sec


Even when stop the traffic, stop the test case, disable the vf, this
message is still being printed.

We must run the test case for about two hours to reproduce the 
problem.
Is there some advise to solve or debug the problem?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help