Re: [PATCH v15 04/13] task_isolation: add initial support
From: Chris Metcalf <hidden>
Date: 2016-09-02 15:39:51
Also in:
linux-mm, lkml
On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf [off-list ref] wrote:quoted
On 8/30/2016 2:43 PM, Andy Lutomirski wrote:quoted
What if we did it the other way around: set a percpu flag saying "going quiescent; disallow new deferred work", then finish all existing work and return to userspace. Then, on the next entry, clear that flag. With the flag set, vmstat would just flush anything that it accumulates immediately, nothing would be added to the LRU list, etc.This is an interesting idea! However, there are a number of implementation ideas that make me worry that it might be a trickier approach overall. First, "on the next entry" hides a world of hurt in four simple words. Some platforms (arm64 and tile, that I'm familiar with) have a common chunk of code that always runs on every entry to the kernel. It would not be too hard to poke at the assembly and make those platforms always run some task-isolation specific code on entry. But x86 scares me - there seem to be a whole lot of ways to get into the kernel, and I'm not convinced there is a lot of shared macrology or whatever that would make it straightforward to intercept all of them.Just use the context tracking entry hook. It's 100% reliable. The relevant x86 function is enter_from_user_mode(), but I would just hook into user_exit() in the common code. (This code is had better be reliable, because context tracking depends on it, and, if context tracking doesn't work on a given arch, then isolation isn't going to work regardless.
This looks a lot cleaner than last time I looked at the x86 code. So yes, I think we could do an entry-point approach plausibly now. This is also good for when we want to look at deferring the kernel TLB flush, since it's the same mechanism that would be required for that.
quoted
But it does seem like we are adding noticeable maintenance cost on the mainline kernel to support task isolation by doing this. My guess is that it is easier to support the kind of "are you clean?" / "get clean" APIs for subsystems, rather than weaving a whole set of "stay clean" mechanism into each subsystem.My intuition is that it's the other way around. For the mainline kernel, having a nice clean well-integrated implementation is nicer than having a bolted-on implementation that interacts in potentially complicated ways. Once quiescence support is in mainline, the size of the diff or the degree to which it's scattered around is irrelevant because it's not a diff any more.
I'm not concerned with the size of the diff, just with the intrusiveness into the various subsystems. That said, code talks, so let me take a swing at doing it the way you suggest for vmstat/lru and we'll see what it looks like.
quoted
So to pop up a level, what is your actual concern about the existing "do it in a loop" model? The macrology currently in use means there is zero cost if you don't configure TASK_ISOLATION, and the software maintenance cost seems low since the idioms used for task isolation in the loop are generally familiar to people reading that code.My concern is that it's not obvious to readers of the code that the loop ever terminates. It really ought to, but it's doing something very odd. Normally we can loop because we get scheduled out, but actually blocking in the return-to-userspace path, especially blocking on a condition that doesn't have a wakeup associated with it, is odd.
True, although, comments :-) Regardless, though, this doesn't seem at all weird to me in the context of the vmstat and lru stuff, though. It's exactly parallel to the fact that we loop around on checking need_resched and signal, and in some cases you could imagine multiple loops around when we schedule out and get a signal, so loop around again, and then another reschedule event happens during signal processing so we go around again, etc. Eventually it settles down. It's the same with the vmstat/lru stuff.
quoted
quoted
Also, this cond_resched stuff doesn't worry me too much at a fundamental level -- if we're really going quiescent, shouldn't we be able to arrange that there are no other schedulable tasks on the CPU in question?We aren't currently planning to enforce things in the scheduler, so if the application affinitizes another task on top of an existing task isolation task, by default the task isolation task just dies. (Unless it's using NOSIG mode, in which case it just ends up stuck in the kernel trying to wait out the dyntick until you either kill it, or re-affinitize the offending task.) But I'm reluctant to guarantee every possible way that you might (perhaps briefly) have some schedulable task, and the current approach seems pretty robust if that sort of thing happens.This kind of waiting out the dyntick scares me. Why is there ever a dyntick that you're waiting out? If quiescence is to be a supported mainline feature, shouldn't the scheduler be integrated well enough with it that you don't need to wait like this?
Well, this is certainly the funkiest piece of the task isolation stuff. The problem is that the dyntick stuff may, for example, need one more tick 4us from now (or whatever) just to close out the current RCU period. We can't return to userspace until that happens. So what else can we do when the task is ready to return to userspace? We could punt into the idle task instead of waiting in this task, which was my earlier schedule_time() suggestion. Do you think that's cleaner?
Have you confirmed that this works correctly wrt PTRACE_SYSCALL? It should result in an even number of events (like raise(2) or an async signal) and that should have a test case.
I have not tested PTRACE_SYSCALL. I'll see about adding that to the selftest code. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>