Re: [PATCH 1/1] NFSD: fix WARN_ON_ONCE in __queue_delayed_work
From: Mike Galbraith <hidden>
Date: 2023-01-11 02:35:22
On Tue, 2023-01-10 at 11:58 -0800, dai.ngo@oracle.com wrote:
On 1/10/23 11:30 AM, Jeff Layton wrote:quoted
quoted
Looking over the traces that Mike posted, I suspect this is the real bug, particularly if the server is being restarted during this test.Yes, I noticed the WARN_ON_ONCE(timer->function != delayed_work_timer_fn) too and this seems to indicate some kind of corruption. However, I'm not sure if Mike's test restarts the nfs-server service. This could be a bug in work queue module when it's under stress.
My reproducer was to merely mount and traverse/md5sum, while that was
going on, fire up LTP's min_free_kbytes testcase (memory hog from hell)
on the server. Systemthing may well be restarting the server service
in response to oomkill. In fact, the struct delayed_work in question
at WARN_ON_ONCE() time didn't look the least bit ready for business.
FWIW, I had noticed the missing cancel while eyeballing, and stuck one
next to the existing one as a hail-mary, but that helped not at all.
crash> delayed_work ffff8881601fab48
struct delayed_work {
work = {
data = {
counter = 1
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
},
timer = {
entry = {
next = 0x0,
pprev = 0x0
},
expires = 0,
function = 0x0,
flags = 0
},
wq = 0x0,
cpu = 0
}