Thread (27 messages) 27 messages, 4 authors, 2023-01-11

Re: [PATCH 1/1] NFSD: fix WARN_ON_ONCE in __queue_delayed_work

From: Mike Galbraith <hidden>
Date: 2023-01-11 02:35:22

On Tue, 2023-01-10 at 11:58 -0800, dai.ngo@oracle.com wrote:
On 1/10/23 11:30 AM, Jeff Layton wrote:
quoted
quoted
Looking over the traces that Mike posted, I suspect this is the real
bug, particularly if the server is being restarted during this test.
Yes, I noticed the WARN_ON_ONCE(timer->function != delayed_work_timer_fn)
too and this seems to indicate some kind of corruption. However, I'm not
sure if Mike's test restarts the nfs-server service. This could be a bug
in work queue module when it's under stress.
My reproducer was to merely mount and traverse/md5sum, while that was
going on, fire up LTP's min_free_kbytes testcase (memory hog from hell)
on the server.  Systemthing may well be restarting the server service
in response to oomkill.  In fact, the struct delayed_work in question
at WARN_ON_ONCE() time didn't look the least bit ready for business.

FWIW, I had noticed the missing cancel while eyeballing, and stuck one
next to the existing one as a hail-mary, but that helped not at all.

crash> delayed_work ffff8881601fab48
struct delayed_work {
  work = {
    data = {
      counter = 1
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  timer = {
    entry = {
      next = 0x0,
      pprev = 0x0
    },
    expires = 0,
    function = 0x0,
    flags = 0
  },
  wq = 0x0,
  cpu = 0
}
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help