Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups
From: Peter Oskolkov <hidden>
Date: 2021-12-15 23:31:22
Also in:
linux-api, lkml
On Wed, Dec 15, 2021 at 3:16 PM Peter Zijlstra [off-list ref] wrote:
On Wed, Dec 15, 2021 at 01:04:33PM -0800, Peter Oskolkov wrote:quoted
On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra [off-list ref] wrote:quoted
On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote:quoted
On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra [off-list ref] wrote:quoted
/* + * Enqueue tsk to it's server's runnable list and wake the server for pickup if + * so desired. Notable LAZY workers will not wake the server and rely on the + * server to do pickup whenever it naturally runs next.No, I never suggested we needed per-server runnable queues: in all my patchsets I had a single list of idle (runnable) workers.This is not about the idle servers.. So without the LAZY thing on, a previously blocked task hitting sys_exit will enqueue itself on the runnable list and wake the server for pickup.How can a blocked task hit sys_exit()? Shouldn't it be RUNNING?Task was RUNNING, hits schedule() after passing through sys_enter(). this marks it BLOCKED. Task wakes again and proceeds to sys_exit(), at which point it's marked RUNNABLE and put on the runnable list. After which it'll kick the server to process said list.
Ah, you are talking about sys_exit hook; sorry, I thought you talked about the exit() syscall. [...]
Well, that's *your* use-case. I'm fairly sure there's more people that want to use this thing.quoted
multiple priorities and work isolation: these are easy to address directly with a scheduler that has a global view rather than multiple per-cpu/per-server schedulers/queues that try to coordinate.You can trivially create this, even if the underlying thing is per-server. Simply have a lock and shared data structure between the servers. Even in the kernel, it should be mostly trivial to create a global policy. The only tricky bit (in the kernel) is the whole affinity muck, but userspace doesn't *need* to do even that.quoted
quoted
LAZY enables that.. *however* it does need to wake the server when it is idle, otherwise they'll all sit there waiting for one another.If all servers are busy running workers, then it is not up to the kernel to "preempt" them in my model: the userspace can set up another thread/task to preempt a misbehaving worker, which will wake the server attached to it.So the way I'm seeing things is that the server *is* the 'CPU'. A UP machine cannot rely on another CPU to make preemption happen. Also, preemption is very much not about misbehaviour. Wakeup can cause a preemption event if the woken task is deemed higher priority than the current running one for example. And time based preemption is definitely also a thing wrt resource distribution.quoted
But in practice there are always workers blocking in the kernel, which wakes their servers, which then reap the woken/runnable workers list, so well-behaving code does not need this.This seems to discount pure computational workloads.quoted
And so we need to figure out this high-level thing first: do we go with the per-server worker queues/lists, or do we go with the approach I use in my patchset? It seems to me that the kernel-side code in my patchset is not more complicated than your patchset is shaping up to be, and some things are actually easier to accomplish, like having a single idle_server_ptr vs this LAZY and/or server "preemption" behavior that you have. Again, I'm OK with having it your way if all needed features are covered, but I think we should be explicit about why per-server/per-cpu model is chosen vs the one I proposed, especially as it seems the kernel side code is not really simpler in the end.So I went with a UP first approach. I made single server preemption driven scheduling work first (both tick and wakeup-preemption are supported).
I agree that the UP approach is better than the LAZY one if we have per-server/per-cpu worker queues.
The whole LAZY thing is only meant to supress some of that (notably
wakeup preemption), but you're right in that it's not really nice. I got
it working, but I'm not particularly happy with it either.
Having the sys_enter/sys_exit hooks also made the page pins short lived,
and signals much simpler to handle. You're destroying signals IIUC.
So I see no fundamental reason why userspace cannot do something like:
struct umcg_task *current = NULL;
for (;;) {
self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;
runnable_ptr = (void *)__atomic_exchange_n(&self->runnable_workers_ptr,
NULL, __ATOMIC_SEQ_CST);
pthread_mutex_lock(&global_queue.lock);
while (runnable_ptr) {
next = (void *)runnable_ptr->runnable_workers_ptr;
enqueue_task(&global_queue, runnable_ptr);
runnable_ptr = next;
}
/* complicated bit about current already running goes here */
current = pick_task(&global_queue);
self->next_tid = current ? current->tid : 0;
unlock:
pthread_mutex_unlock(&global_queue.lock);
ret = sys_umcg_wait(0, 0);
pthread_mutex_lock(&global_queue.lock);
/* umcg_wait() didn't switch, make sure to return the task */
if (self->next_tid) {
enqueue_task(&global_queue, current);
current = NULL;
}
pthread_mutex_unlock(&global_queue.lock);
// do something with @ret
}
to get global scheduling and all the contention^Wgoodness related to it.
Except, of course, it's more complicated, but I think the idea's clear
enough.Let me spend some time and see if I can make all of this work together beyond simple tests. With the upcoming holidays and some other things I am busy with, this may take more than a week, I'm afraid...