Thread (44 messages) 44 messages, 5 authors, 2022-01-20

Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls

From: Peter Oskolkov <hidden>
Date: 2021-11-29 22:43:41
Also in: linux-mm, lkml

On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra [off-list ref] wrote:
On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:
quoted
wait_wake_only is not needed if you have both next_tid and server_tid,
as your patch has. In my version of the patch, next_tid is the same as
server_tid, so the flag is needed to indicate to the kernel that
next_tid is the wakee, not the server.
Ah, okay.
quoted
re: (idle_)server_tid_ptr: it seems that you assume that blocked
workers keep their servers, while in my patch they "lose them" once
they block, and so there should be a global idle server pointer to
wake the server in my scheme (if there is an idle one). The main
difference is that in my approach a server has only a single, running,
worker assigned to it, while in your approach it can have a number of
blocked/idle workers to take care of as well.
Correct; I've been thinking in analogues of the way we schedule CPUs.
Each CPU has a ready/run queue along with the current task.
fundamentally the RUNNABLE tasks need to go somewhere when all servers
are busy. So at that point the previous server is as good a place as
any.

Now, I sympathise with a blocked task not having a relation; I often
argue this same, since we have wakeup balancing etc. And I've not really
thought about how to best do wakeup-balancing, also see below.
quoted
The main difference between our approaches, as I see it: in my
approach if a worker is running, its server is sleeping, period. If we
have N servers, and N running workers, there are no servers to wake
when a previously blocked worker finishes its blocking op. In your
approach, it seems that N servers have each a bunch of workers
pointing at them, and a single worker running. If a previously blocked
worker wakes up, it wakes the server it was assigned to previously,
Right; it does that. It can check the ::state of it's current task,
possibly set TF_PREEMPT or just go back to sleep.
quoted
and so now we have more than N physical tasks/threads running: N
workers and the woken server. This is not ideal: if the process is
affined to only N CPUs, that means a worker will be preempted to let
the woken server run, which is somewhat against the goal of letting
the workers run more or less uninterrupted. This is not deal breaking,
but maybe something to keep in mind.
I suppose it's easy enough to make this behaviour configurable though;
simply enqueue and not wake.... Hmm.. how would this worker know if the
server was 'busy' or not? The whole 'current' thing is a user-space
construct. I suppose that's what your pointer was for? Puts an actual
idle server in there, if there is one. Let me ponder that a bit.
Yes, the idle_server_ptr was there to point to an idle server; this
naturally did wakeup balancing.
However, do note this whole scheme fundamentally has some of that, the
moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
all tasks, they can consume however much time the syscall needs there.

Also, timeout on sys_umcg_wait() gets you the exact same situation (or
worse, multiple running workers).
It should not. Timed out workers should be added to the runnable list
and not become running unless a server chooses so. So sys_umcg_wait()
with a timeout should behave similarly to a normal sleep, in that the
server is woken upon the worker blocking, and upon the worker wakeup
the worker is added to the woken workers list and waits for a server
to run it. The only difference is that in a sleep the worker becomes
BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
time.

Why then have sys_umcg_wait() with a timeout at all, instead of
calling nanosleep()? Because the worker in sys_umcg_wait() can be
context-switched into by another worker, or made running by a server;
if the worker is in nanosleep(), it just sleeps.
quoted
Another big concern I have is that you removed UMCG_TF_LOCKED. I
OOh yes, I forgot to mention that. I couldn't figure out what it was
supposed to do.
quoted
definitely needed it to guard workers during "sched work" in the
userspace in my approach. I'm not sure if the flag is absolutely
needed with your approach, but most likely it is - the kernel-side
scheduler does lock tasks and runqueues and disables interrupts and
migrations and other things so that the scheduling logic is not
hijacked by concurrent stuff. Why do you assume that the userspace
scheduling code does not need similar protections?
I've not yet come across a case where this is needed. Migration for
instance is possible when RUNNABLE, simply write ::server_tid before
::state. Userspace just needs to make sure who actually owns the task,
but it can do that outside of this state.

But like I said; I've not yet done the userspace part (and I lost most
of today trying to install a new machine), so perhaps I'll run into it
soon enough.
The most obvious scenario where I needed locking is when worker A
wants to context switch into worker B, while another worker C wants to
context switch into worker A, and worker A pagefaults. This involves:

worker A context: worker A context switches into worker B:

- worker B::server_tid = worker A::server_tid
- worker A::server_tid = none
- worker A::state = runnable
- worker B::state = running
- worker A::next_tid = worker B
- worker A calls sys_umcg_wait()

worker B context: before the above completes, worker C wants to
context switch into worker A, with similar steps.

"interrupt context": in the middle of the mess above, worker A pagefaults

Too many moving parts. UMCG_TF_LOCKED helped me make this mess
manageable. Maybe without pagefaults clever ordering of the operations
listed above could make things work, but pagefaults mess things badly,
so some kind of "preempt_disable()" for the userspace scheduling code
was needed, and UMCG_TF_LOCKED was the solution I had.


Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help