Thread (35 messages) 35 messages, 6 authors, 2020-01-23

Re: [PATCH v2 2/5] mm: introduce external memory hinting API

From: Michal Hocko <hidden>
Date: 2020-01-20 08:03:34
Also in: linux-mm, lkml

On Fri 17-01-20 09:25:42, Minchan Kim wrote:
On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
quoted
On Thu 16-01-20 15:59:50, Minchan Kim wrote:
quoted
There is usecase that System Management Software(SMS) want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
in the case of Android, it is the ActivityManagerService.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2).
It uses pidfd of an external processs to give the hint.

 int process_madvise(int pidfd, void *addr, size_t length, int advise,
			unsigned long flag);

Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the
API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.
Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's more safe for maintainace rather than
introducing a buggy syscall but hard to fix it later.
I have brought this up when we discussed this in the past but there is
no reflection on that here so let me bring that up again. 

I believe that the interface has an inherent problem that it is racy.
The external entity needs to know the address space layout of the target
process to do anyhing useful on it. The address space is however under
the full control of the target process though and the external entity
has no means to find out that the layout has changed. So
time-to-check-time-to-act is an inherent problem.

This is a serious design flaw and it should be explained why it doesn't
matter or how to use the interface properly to prevent that problem.
Sorry for the missing that part.

It's not a particular problem of this API because other APIs already have
done with that(e.g., move_pages, process_vm_writev).
I am sorry but this is not really an argument.
Point is userspace has several ways for the control of target process
like SIGSTOP, cgroup freezer or even no need to control since platform
is already aware of that the process will never run until he grant it
or it's resilient even though the race happens.
If you have that level of control then you can simply inject the code
via ptrace and you do not need a new syscall in the first place.
In future, if we want to support more fine-grained consistency model
like memory layout, we could provide some API to get cookie(e.g.,
seq count which is updated whenever vma of the process changes).  And then
we could feed the cookie to process_madvise's last argument so that
it can fail if founds it's not matched.
For that API, Daniel already posted RFC - process_getinfo[1].
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
So why do not we start with a clean API since the beginning? I do agree
that a remote madvise is an interesting feature and it opens gates to
all sorts of userspace memory management which is not possible this
days. But the syscall has to have a reasonable semantic to allow that.
We cannot simply start with a half proken symantic first based on an
Android usecase and then hit the wall as soon as others with a different
user space model want to use it as well.
-- 
Michal Hocko
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help