Re: PROBLEM: double fault in md_end_io

From: Paweł Wiejacha <hidden>
Date: 2021-04-15 15:36:19

Pawel, have you tried to repro with md-next branch?

Yes, and it also crashes:

$ git log --decorate --format=oneline
ca882ba4c0478c68f927fa7b622ec17bde9361ce (HEAD -> md-next-pw2) using
slab_pool for md_io_pool
2715e61834586cef8292fcaa457cbf2da955a3b8 (song-md/md-next) md/bitmap:
wait for external bitmap writes to complete during tear down
c84917aa5dfc4c809634120fb429f0ff590a1f75 md: do not return existing
mddevs from mddev_find_or_alloc

<0>[ 2086.596361] traps: PANIC: double fault, error_code: 0x0
<4>[ 2086.596364] double fault: 0000 [#1] SMP NOPTI
<4>[ 2086.596365] CPU: 40 PID: 0 Comm: swapper/40 Tainted: G
OE     5.12.0-rc3-md-next-pw-fix2 #1
<4>[ 2086.596365] Hardware name: empty S8030GM2NE/S8030GM2NE, BIOS
V4.00 03/11/2021
<4>[ 2086.596366] RIP: 0010:__slab_free+0x26/0x380
<4>[ 2086.596367] Code: 1f 44 00 00 0f 1f 44 00 00 55 49 89 ca 45 89
c3 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 f4 53 48 83 e4 f0
48 83 ec 70 <48> 89 54 24 28 0f 1f 44 00 00 41 8b 46 28 4d 8b 6c 24 20
49 8b 5c
<4>[ 2086.596368] RSP: 0018:ffff96fa4d1c8f90 EFLAGS: 00010086
<4>[ 2086.596369] RAX: ffff892f4fc35300 RBX: ffff890fdca3ff78 RCX:
ffff890fdca3ff78
<4>[ 2086.596370] RDX: ffff890fdca3ff78 RSI: fffff393c1728fc0 RDI:
ffff892fd3a9b300
<4>[ 2086.596370] RBP: ffff96fa4d1c9030 R08: 0000000000000001 R09:
ffffffffb66500a7
<4>[ 2086.596371] R10: ffff890fdca3ff78 R11: 0000000000000001 R12:
fffff393c1728fc0
<4>[ 2086.596371] R13: fffff393c1728fc0 R14: ffff892fd3a9b300 R15:
0000000000000000
<4>[ 2086.596372] FS:  0000000000000000(0000)
GS:ffff892f4fc00000(0000) knlGS:0000000000000000
<4>[ 2086.596373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 2086.596373] CR2: ffff96fa4d1c8f88 CR3: 00000010c8210000 CR4:
0000000000350ee0
<4>[ 2086.596373] Call Trace:
<4>[ 2086.596374]  <IRQ>
<4>[ 2086.596374]  ? kmem_cache_free+0x3d2/0x420
<4>[ 2086.596374]  ? mempool_free_slab+0x17/0x20
<4>[ 2086.596375]  ? mempool_free_slab+0x17/0x20
<4>[ 2086.596375]  ? mempool_free+0x2f/0x80
<4>[ 2086.596376]  ? md_end_io+0x47/0x60
<4>[ 2086.596376]  ? bio_endio+0xee/0x140
<4>[ 2086.596376]  ? bio_chain_endio+0x2d/0x40
<4>[ 2086.596377]  ? md_end_io+0x59/0x60
<4>[ 2086.596377]  ? bio_endio+0xee/0x140
<4>[ 2086.596378]  ? bio_chain_endio+0x2d/0x40
...
<4>[ 2086.596485] Lost 340 message(s)!

Dumping last: `watch -n0.1 ...`:

stat 451 0 13738 76 67823521 0 17162428974 801228700 351 1886784
801228776 0 0 0 0 0 0
inflight 0 354

md_io/aliases 4
md_io/align 8
md_io/cache_dma 0
md_io/cpu_partial 30
md_io/cpu_slabs 1235 N0=288 N1=239 N2=463 N3=245
md_io/destroy_by_rcu 0
md_io/hwcache_align 0
md_io/min_partial 5
md_io/objects 126582 N0=26928 N1=28254 N2=44064 N3=27336
md_io/object_size 40
md_io/objects_partial 0
md_io/objs_per_slab 102
md_io/order 0
md_io/partial 8 N2=4 N3=4
md_io/poison 0
md_io/reclaim_account 0
md_io/red_zone 0
md_io/remote_node_defrag_ratio 100
md_io/sanity_checks 0
md_io/slabs 1249 N0=264 N1=277 N2=436 N3=272
md_io/slabs_cpu_partial 1172(1172) C0=14(14) C1=17(17) C2=17(17)
C3=17(17) C4=18(18) C5=17(17) C6=18(18) C7=17(17) C8=21(21) C9=4(4)
C10=21(21) C11=22(22) C12=21(21) C13=25(25) C14=15(15) C15
=19(19) C16=24(24) C17=19(19) C18=22(22) C19=15(15) C20=28(28)
C21=18(18) C22=18(18) C23=24(24) C24=17(17) C25=27(27) C26=8(8)
C27=18(18) C28=17(17) C29=18(18) C30=21(21) C31=21(21) C32=9(9)
C33=9(9) C34=24(24) C35=11(11) C36=19(19) C37=14(14) C38=18(18)
C39=4(4) C40=23(23) C41=20(20) C42=18(18) C43=22(22) C44=21(21)
C45=24(24) C46=20(20) C47=15(15) C48=17(17) C49=15(15) C50=16(1
6) C51=26(26) C52=21(21) C53=19(19) C54=16(16) C55=18(18) C56=26(26)
C57=14(14) C58=18(18) C59=23(23) C60=18(18) C61=20(20) C62=19(19)
C63=17(17)
md_io/slab_size 40
md_io/store_user 0
md_io/total_objects 127398 N0=26928 N1=28254 N2=44472 N3=27744
md_io/trace 0
md_io/usersize 0

I am not able to reproduce the issue [...]

It crashes on two different machines. Next week I'm going to upgrade a
distro on an older machine (with Intel NVMe disks, different
motherboard and Xeon instead of EPYC 2 CPU) running currently
linux-5.4 without this issue. So I will let you know if switching to a
newer kernel with "improved io stats accounting" makes it unstable or
not.

Best regards,
Pawel Wiejacha.

On Thu, 15 Apr 2021 at 08:35, Song Liu [off-list ref] wrote:

On Wed, Apr 14, 2021 at 5:36 PM Song Liu [off-list ref] wrote:

quoted

On Tue, Apr 13, 2021 at 5:05 AM Paweł Wiejacha
[off-list ref] wrote:

quoted

Hello Song,

That code does not compile, but I guess that what you meant was
something like this:

Yeah.. I am really sorry for the noise.

quoted

diff --git drivers/md/md.c drivers/md/md.c
index 04384452a..cbc97a96b 100644
--- drivers/md/md.c
+++ drivers/md/md.c

@@ -78,6 +78,7 @@ static DEFINE_SPINLOCK(pers_lock);

 static struct kobj_type md_ktype;

+struct kmem_cache *md_io_cache;
 struct md_cluster_operations *md_cluster_ops;
 EXPORT_SYMBOL(md_cluster_ops);
 static struct module *md_cluster_mod;

@@ -5701,8 +5702,8 @@ static int md_alloc(dev_t dev, char *name)
         */
        mddev->hold_active = UNTIL_STOP;

-   error = mempool_init_kmalloc_pool(&mddev->md_io_pool, BIO_POOL_SIZE,
-                     sizeof(struct md_io));
+   error = mempool_init_slab_pool(&mddev->md_io_pool, BIO_POOL_SIZE,
+                     md_io_cache);
    if (error)
        goto abort;

@@ -9542,6 +9543,10 @@ static int __init md_init(void)
 {
    int ret = -ENOMEM;

+   md_io_cache = KMEM_CACHE(md_io, 0);
+   if (!md_io_cache)
+       goto err_md_io_cache;
+
    md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
    if (!md_wq)
        goto err_wq;

@@ -9578,6 +9583,8 @@ static int __init md_init(void)
 err_misc_wq:
    destroy_workqueue(md_wq);
 err_wq:
+   kmem_cache_destroy(md_io_cache);
+err_md_io_cache:
    return ret;
 }

@@ -9863,6 +9870,7 @@ static __exit void md_exit(void)
    destroy_workqueue(md_rdev_misc_wq);
    destroy_workqueue(md_misc_wq);
    destroy_workqueue(md_wq);
+   kmem_cache_destroy(md_io_cache);
 }

 subsys_initcall(md_init);

[...]

quoted

$ watch -n0.2 'cat /proc/meminfo | paste - - | tee -a ~/meminfo'
MemTotal:       528235648 kB    MemFree:        20002732 kB
MemAvailable:   483890268 kB    Buffers:            7356 kB
Cached:         495416180 kB    SwapCached:            0 kB
Active:         96396800 kB     Inactive:       399891308 kB
Active(anon):      10976 kB     Inactive(anon):   890908 kB
Active(file):   96385824 kB     Inactive(file): 399000400 kB
Unevictable:       78768 kB     Mlocked:           78768 kB
SwapTotal:             0 kB     SwapFree:              0 kB
Dirty:          88422072 kB     Writeback:        948756 kB
AnonPages:        945772 kB     Mapped:            57300 kB
Shmem:             26300 kB     KReclaimable:    7248160 kB
Slab:            7962748 kB     SReclaimable:    7248160 kB
SUnreclaim:       714588 kB     KernelStack:       18288 kB
PageTables:        10796 kB     NFS_Unstable:          0 kB
Bounce:                0 kB     WritebackTmp:          0 kB
CommitLimit:    264117824 kB    Committed_AS:   21816824 kB
VmallocTotal:   34359738367 kB  VmallocUsed:      561588 kB
VmallocChunk:          0 kB     Percpu:            65792 kB
HardwareCorrupted:     0 kB     AnonHugePages:         0 kB
ShmemHugePages:        0 kB     ShmemPmdMapped:        0 kB
FileHugePages:         0 kB     FilePmdMapped:         0 kB
HugePages_Total:       0        HugePages_Free:        0
HugePages_Rsvd:        0        HugePages_Surp:        0
Hugepagesize:       2048 kB     Hugetlb:               0 kB
DirectMap4k:      541000 kB     DirectMap2M:    11907072 kB
DirectMap1G:    525336576 kB

And thanks for these information.

I have set up a system to run the test, the code I am using is the top of the
md-next branch. I will update later tonight on the status.

I am not able to reproduce the issue after 6 hours. Maybe it is because I run
tests on 3 partitions of the same nvme SSD. I will try on a different host with
multiple SSDs.

Pawel, have you tried to repro with md-next branch?

Thanks,
Song

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help