Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups

From: Peter Oskolkov <hidden>
Date: 2021-12-15 23:31:22
Also in: linux-api, lkml

On Wed, Dec 15, 2021 at 3:16 PM Peter Zijlstra [off-list ref] wrote:

On Wed, Dec 15, 2021 at 01:04:33PM -0800, Peter Oskolkov wrote:

quoted

On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra [off-list ref] wrote:

quoted

On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote:

quoted

On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra [off-list ref] wrote:

quoted

 /*
+ * Enqueue tsk to it's server's runnable list and wake the server for pickup if
+ * so desired. Notable LAZY workers will not wake the server and rely on the
+ * server to do pickup whenever it naturally runs next.

No, I never suggested we needed per-server runnable queues: in all my
patchsets I had a single list of idle (runnable) workers.

This is not about the idle servers..

So without the LAZY thing on, a previously blocked task hitting sys_exit
will enqueue itself on the runnable list and wake the server for pickup.

How can a blocked task hit sys_exit()? Shouldn't it be RUNNING?

Task was RUNNING, hits schedule() after passing through sys_enter().
this marks it BLOCKED. Task wakes again and proceeds to sys_exit(), at
which point it's marked RUNNABLE and put on the runnable list. After
which it'll kick the server to process said list.

Ah, you are talking about sys_exit hook; sorry, I thought you talked
about the exit() syscall.

[...]

Well, that's *your* use-case. I'm fairly sure there's more people that
want to use this thing.

quoted

multiple
priorities and work isolation: these are easy to address directly with
a scheduler that has a global view rather than multiple
per-cpu/per-server schedulers/queues that try to coordinate.

You can trivially create this, even if the underlying thing is
per-server. Simply have a lock and shared data structure between the
servers.

Even in the kernel, it should be mostly trivial to create a global
policy. The only tricky bit (in the kernel) is the whole affinity muck,
but userspace doesn't *need* to do even that.

quoted

LAZY enables that.. *however* it does need to wake the server when it is
idle, otherwise they'll all sit there waiting for one another.

If all servers are busy running workers, then it is not up to the
kernel to "preempt" them in my model: the userspace can set up another
thread/task to preempt a misbehaving worker, which will wake the
server attached to it.

So the way I'm seeing things is that the server *is* the 'CPU'. A UP
machine cannot rely on another CPU to make preemption happen.

Also, preemption is very much not about misbehaviour. Wakeup can cause a
preemption event if the woken task is deemed higher priority than the
current running one for example.

And time based preemption is definitely also a thing wrt resource
distribution.

quoted

But in practice there are always workers
blocking in the kernel, which wakes their servers, which then reap the
woken/runnable workers list, so well-behaving code does not need this.

This seems to discount pure computational workloads.

quoted

And so we need to figure out this high-level thing first: do we go
with the per-server worker queues/lists, or do we go with the approach
I use in my patchset? It seems to me that the kernel-side code in my
patchset is not more complicated than your patchset is shaping up to
be, and some things are actually easier to accomplish, like having a
single idle_server_ptr vs this LAZY and/or server "preemption"
behavior that you have.

Again, I'm OK with having it your way if all needed features are
covered, but I think we should be explicit about why
per-server/per-cpu model is chosen vs the one I proposed, especially
as it seems the kernel side code is not really simpler in the end.

So I went with a UP first approach. I made single server preemption
driven scheduling work first (both tick and wakeup-preemption are
supported).

I agree that the UP approach is better than the LAZY one if we have
per-server/per-cpu worker queues.

The whole LAZY thing is only meant to supress some of that (notably
wakeup preemption), but you're right in that it's not really nice. I got
it working, but I'm not particularly happy with it either.

Having the sys_enter/sys_exit hooks also made the page pins short lived,
and signals much simpler to handle. You're destroying signals IIUC.


So I see no fundamental reason why userspace cannot do something like:

        struct umcg_task *current = NULL;

        for (;;) {
                self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;

                runnable_ptr = (void *)__atomic_exchange_n(&self->runnable_workers_ptr,
                                                           NULL, __ATOMIC_SEQ_CST);

                pthread_mutex_lock(&global_queue.lock);
                while (runnable_ptr) {
                        next = (void *)runnable_ptr->runnable_workers_ptr;
                        enqueue_task(&global_queue, runnable_ptr);
                        runnable_ptr = next;
                }

                /* complicated bit about current already running goes here */

                current = pick_task(&global_queue);
                self->next_tid = current ? current->tid : 0;
unlock:
                pthread_mutex_unlock(&global_queue.lock);

                ret = sys_umcg_wait(0, 0);

                pthread_mutex_lock(&global_queue.lock);
                /* umcg_wait() didn't switch, make sure to return the task */
                if (self->next_tid) {
                        enqueue_task(&global_queue, current);
                        current = NULL;
                }
                pthread_mutex_unlock(&global_queue.lock);

                // do something with @ret
        }

to get global scheduling and all the contention^Wgoodness related to it.
Except, of course, it's more complicated, but I think the idea's clear
enough.

Let me spend some time and see if I can make all of this work together
beyond simple tests. With the upcoming holidays and some other things
I am busy with, this may take more than a week, I'm afraid...

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help