Thread (40 messages) 40 messages, 6 authors, 2022-01-20

Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups

From: Peter Zijlstra <peterz@infradead.org>
Date: 2021-12-15 17:55:13
Also in: linux-api, lkml

On Wed, Dec 15, 2021 at 01:49:28PM +0000, Matthew Wilcox wrote:
On Wed, Dec 15, 2021 at 11:44:49AM +0100, Peter Zijlstra wrote:
quoted
On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote:
quoted
Anyway, I'll test your patchset over the next week or so and let you
know if anything really needed is missing (other than waking an idle
server if there is one on a worker wakeup; this piece is definitely
needed).
Right, so the problem I'm having is that a single idle server ptr like
before can trivially miss waking annother idle server.

Suppose:

	umcg::idle_server_tid_ptr

Then the enqueue_and_wake() thing from the last patch would:

	idle_server_tid = xchg((pid_t __user *)self->idle_server_tid_ptr, 0);

to consume the tid, and then use that to enqueue and wake. But what if a
second wakeup happens right after that? There might be a second idle
server, but we'll never find it, because userspace hasn't had time to
update the field again.

Alternatively, we do a linked list of servers, but then every such
wakeup needs to iterate the whole list, looking for one that has
UMCG_TF_IDLE set, or something like that, but that lookup is bad for
performance.

So I'm really not sure what way to go yet.
1. Linked lists are fugly and bad for the CPU.
Absolutely.. although a stack might work, except for that ABA issue (and
contention).
2. I'm not sure how big the 'N' in 'M:N' is supposed to be.  Might be
one per hardware thread?  So it could be hundreds-to-thousands,
depending on the scale of system.
Typically yes, one server task per hardware thread. Now, I'm also fairly
sure you don't want excessive cross-node traffic for this stuff, so that
puts a limit on things as well.
3. The interface between user-kernel could be an array of idle tids,
maybe 16 entries long (16 * 4 = 64 bytes, just one cacheline).  As a
server finishes work, it looks for a 0 tid in the batch and stores
its tid in the slot (cmpxchg, I guess, since the array will be shared
between processes).  If there are no free slots in the array, then we
definitely have 16 threads already waiting for work, so it can park itself
in whatever data structure userspace wants to use to manage idle servers.
It's up to userspace to decide when to repopulate the array of available
servers from its data structure of idle servers.
Right, a tid array might work. Could even have userspace specify the
length, then it can do the trade-offs all on it's own. Either a fixed
location for each server and a larger array, or clever things, whatever
they want.

I suppose I'll code up the variable length array, we have space for
that.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help