Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: 2017-08-28 22:15:13

Tejun Heo wrote:

On Mon, Aug 28, 2017 at 02:10:56PM +0200, Michal Hocko wrote:

quoted

I am not sure I understand how WQ_HIGHPRI actually helps. The work item
will get served by a thread with higher priority and from a different
pool than regular WQs. But what prevents the same issue as described
above when the highprio pool gets congested? In other words what make
WQ_HIGHPRI less prone to long stalls when we are under low memory
situation and new workers cannot be allocated?

So, the problem wasn't new worker not getting allocated due to memory
pressure.  Rescuer can handle that.  The problem is that the regular
worker pool is occupied with something which is constantly in runnable
state - most likely writeback / reclaim, so the workqueue doesn't
schedule the other work items.

Setting WQ_HIGHPRI works as highpri worker pool isn't likely to be
contended that way but might not be the best solution.  The right
thing to do would be setting WQ_CPU_INTENSIVE on the work items which
can burn a lot of CPU cycles so that it doesn't get in the way of
other work items (workqueue should probably trigger a warning on these
work items too).

Tetuso, can you please try to find which work items are occupying the
worker pool for an extended period time under memory pressure and set
WQ_CPU_INTENSIVE on them?

Isn't it any work item which does __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
allocation, for doing __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation
burns a lot of CPU cycles under memory pressure? In other words, won't we end up
with setting WQ_CPU_INTENSIVE to almost all workqueues?

----------
[  605.720125] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 208s!
[  605.736025] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.746669] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.755091] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 64s!
[  605.763390] Showing busy workqueues and worker pools:
[  605.769436] workqueue events: flags=0x0
[  605.772204]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  605.775548]     pending: console_callback{197431}, vmw_fb_dirty_flush [vmwgfx]{174896}, sysrq_reinject_alt_sysrq{174440}, push_to_pool{162245}
[  605.780761]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[  605.783603]     pending: e1000_watchdog [e1000]{207984}, check_corruption{166511}, rht_deferred_worker{28894}
[  605.787725]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.790682]     pending: vmpressure_work_fn{209065}, e1000_watchdog [e1000]{207615}
[  605.794271]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.797150]     pending: vmstat_shepherd{208067}
[  605.799610] workqueue events_long: flags=0x0
[  605.801951]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.805098]     pending: gc_worker [nf_conntrack]{208961}
[  605.807976] workqueue events_freezable: flags=0x4
[  605.810391]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.813151]     pending: vmballoon_work [vmw_balloon]{208085}
[  605.815851] workqueue events_power_efficient: flags=0x80
[  605.818382]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  605.821124]     pending: check_lifetime{64453}
[  605.823337]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.826091]     pending: neigh_periodic_work{199329}
[  605.828426]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.831068]     pending: fb_flashcursor{209042}, do_cache_clean{201882}
[  605.833902]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.836545]     pending: neigh_periodic_work{195234}
[  605.838838] workqueue events_freezable_power_: flags=0x84
[  605.841295]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.843824]     pending: disk_events_workfn{208625}
[  605.846084] workqueue mm_percpu_wq: flags=0x8
[  605.848145]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.850667]     pending: drain_local_pages_wq{209047} BAR(4561){209047}
[  605.853368] workqueue writeback: flags=0x4e
[  605.855382]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[  605.857793]     in-flight: 354:wb_workfn{182977}
[  605.860182] workqueue xfs-data/sda1: flags=0xc
[  605.862314]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=23/256 MAYDAY
[  605.865010]     in-flight: 3221:xfs_end_io [xfs]{209041}, 3212:xfs_end_io [xfs]{209158}, 29:xfs_end_io [xfs]{209200}, 3230:xfs_end_io [xfs]{209171}, 3229:xfs_end_io [xfs]{209099}, 50:xfs_end_io [xfs]{209099}, 3223:xfs_end_io [xfs]{209045}, 165:xfs_end_io [xfs]{209052}, 3215:xfs_end_io [xfs]{209046}
[  605.874362]     pending: xfs_end_io [xfs]{209011}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{208999}, xfs_end_io [xfs]{208977}, xfs_end_io [xfs]{208975}, xfs_end_io [xfs]{208970}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208950}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208946}, xfs_end_io [xfs]{30655}
[  605.886882]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256 MAYDAY
[  605.889661]     in-flight: 3225:xfs_end_io [xfs]{209249}, 236:xfs_end_io [xfs]{209163}, 23:xfs_end_io [xfs]{209151}, 3228:xfs_end_io [xfs]{209151}, 4380:xfs_end_io [xfs]{209259}, 3214:xfs_end_io [xfs]{209240}, 3220:xfs_end_io [xfs]{209212}, 3227:xfs_end_io [xfs]{209233}
[  605.898706]     pending: xfs_end_io [xfs]{209159}, xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209116}, xfs_end_io [xfs]{209110}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209092}, xfs_end_io [xfs]{209082}, xfs_end_io [xfs]{209061}, xfs_end_io [xfs]{209058}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209040}, xfs_end_io [xfs]{209021}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{30678}
[  605.917299]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=24/256 MAYDAY
[  605.920254]     in-flight: 375(RESCUER):xfs_end_io [xfs]{209194}, 42:xfs_end_io [xfs]{209278}, 3222:xfs_end_io [xfs]{209286}, 17:xfs_end_io [xfs]{209195}, 65:xfs_end_io [xfs]{209241}, 122:xfs_end_io [xfs]{209230}
[  605.927845]     pending: xfs_end_io [xfs]{209187}, xfs_end_io [xfs]{209154}, xfs_end_io [xfs]{209113}, xfs_end_io [xfs]{209088}, xfs_end_io [xfs]{209081}, xfs_end_io [xfs]{209071}, xfs_end_io [xfs]{209070}, xfs_end_io [xfs]{209067}, xfs_end_io [xfs]{209062}, xfs_end_io [xfs]{209053}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209047}, xfs_end_io [xfs]{209032}, xfs_end_io [xfs]{209027}, xfs_end_io [xfs]{209017}, xfs_end_io [xfs]{209016}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{209011}
[  605.944773]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  605.947616]     in-flight: 3218:xfs_end_io [xfs]{209268}, 3216:xfs_end_io [xfs]{209265}, 3:xfs_end_io [xfs]{209223}, 33:xfs_end_io [xfs]{209181}, 101:xfs_end_io [xfs]{209159}, 4381:xfs_end_io [xfs]{209294}, 3219:xfs_end_io [xfs]{209181}
[  605.956005]     pending: xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209141}, xfs_end_io [xfs]{209133}, xfs_end_io [xfs]{209057}, xfs_end_io [xfs]{209026}, xfs_end_io [xfs]{209025}
[  605.963018] workqueue xfs-sync/sda1: flags=0x4
[  605.965455]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.968299]     pending: xfs_log_worker [xfs]{202031}
[  605.970872] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=209s workers=8 manager: 3224
[  605.974252] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=209s workers=6 manager: 3213
[  605.977682] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=209s workers=9 manager: 47
[  605.981015] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=64s workers=10 manager: 3217
[  605.984382] pool 128: cpus=0-63 flags=0x4 nice=0 hung=183s workers=3 idle: 355 356
----------

quoted

If we do want to make
sure that work items on mm_percpu_wq workqueue are executed without delays,
we need to consider using kthread_workers instead of workqueue. (Or, maybe
somehow we can share one kthread with constantly manipulating cpumask?)

Hmm, that doesn't sound like a bad idea to me. We already have a rescuer
thread that basically sits idle all the time so having a dedicated
kernel thread will not be more expensive wrt. resources. So I think this
is a more reasonable approach than playing with WQ_HIGHPRI which smells
like a quite obscure workaround than a real fix to me.

Well, there's one rescuer in the whole system and you'd need
nr_online_cpus kthreads if you wanna avoid constant cacheline
bouncing.

Excuse me, one rescuer kernel thread per each WQ_MEM_RECLAIM workqueue, doesn't it?

My thought is to stop using WQ_MEM_RECLAIM workqueue for mm_percpu_wq and use a
dedicated kernel thread like oom_reaper. Since the frequency of calling handler
function seems to be once per a second for each online CPU, I thought switching
cpumask for NR_CPUS times per a second is tolerable.

Or, yet another approach would be to use split counters

  Each CPU writes up-to-date values to per-CPU counters.
  The aggregator kernel thread reads up-to-date values from per-CPU counters,
  calculates diff between up-to-date values and previous values, saves up-to-date
  values as previous values, and reflects the diff to global counters.

if cost of reading per-CPU counters of online CPUs is smaller than cost of
switching cpumask for each online CPU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help