Thread (78 messages) 78 messages, 10 authors, 2021-02-04

Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

From: Michal Hocko <mhocko@suse.com>
Date: 2021-02-02 09:37:47
Also in: linux-api, linux-arch, linux-fsdevel, linux-kselftest, linux-mm, linux-riscv, lkml, nvdimm

On Mon 01-02-21 08:56:19, James Bottomley wrote:
On Fri, 2021-01-29 at 09:23 +0100, Michal Hocko wrote:
quoted
On Thu 28-01-21 13:05:02, James Bottomley wrote:
quoted
Obviously the API choice could be revisited
but do you have anything to add over the previous discussion, or is
this just to get your access control?
Well, access control is certainly one thing which I still believe is
missing. But if there is a general agreement that the direct map
manipulation is not that critical then this will become much less of
a problem of course.
The secret memory is a scarce resource but it's not a facility that
should only be available to some users.
How those two objectives go along? Or maybe our understanding of what
scrace really means here. If the pool of the secret memory is very limited
then you really need a way to stop one party from depriving others. More
on that below.
quoted
It all boils down whether secret memory is a scarce resource. With
the existing implementation it really is. It is effectivelly
repeating same design errors as hugetlb did. And look now, we have a
subtle and convoluted reservation code to track mmap requests and we
have a cgroup controller to, guess what, have at least some control
over distribution if the preallocated pool. See where am I coming
from?
I'm fairly sure rlimit is the correct way to control this.  The
subtlety in both rlimit and memcg tracking comes from deciding to
account under an existing category rather than having our own new one. 
People don't like new stuff in accounting because it requires
modifications to everything in userspace.  Accounting under and
existing limit keeps userspace the same but leads to endless arguments
about which limit it should be under.  It took us several patch set
iterations to get to a fragile consensus on this which you're now
disrupting for reasons you're not making clear.
I hoped I had made my points really clear. The existing scheme allows
one users (potentially adversary) to deplete the preallocated pool
and cause a shitstorm of OOM killer because there is no real way to
replenish the pool from the oom killer other than randomly keep killing
tasks until one happens to release its secret memory back to the
pool. Is that more clear now?

And no, rlimit and memcg limit will not save you from that because the
former is per process and later is hard to manage under a single limit
which might be order of magnitude larger than the secret memory pool
size. See the point?

I have also proposed potential ways out of this. Either the pool is not
fixed sized and you make it a regular unevictable memory (if direct map
fragmentation is not considered a major problem) or you need a careful
access control or you need SIGBUS on the mmap failure (to allow at least
some fallback mode to caller).

I do not see any other way around it. I might be missing some other
ways but so far I keep hearing that the existing scheme is just fine
because this has been discussed in the past and you have agreed it is
ok. Without any specifics...

Please keep in mind this is a user interface and it is due to careful
scrutiny. So rather than pushing back with "you are disrupting a
consensus" kinda feedback, please try to stay technical.
quoted
If the secret memory is more in line with mlock without any imposed
limit (other than available memory) in the end then, sure, using the
same access control as mlock sounds reasonable. Btw. if this is
really just a more restrictive mlock then is there any reason to not
hook this into the existing mlock infrastructure (e.g.
MCL_EXCLUSIVE)? Implications would be that direct map would be
handled on instantiation/tear down paths, migration would deal with
the same (if possible). Other than that it would be mlock like.
In the very first patch set we proposed a mmap flag to do this.  Under
detailed probing it emerged that this suffers from several design
problems: the KVM people want VMM to be able to remove the secret
memory range from the process; there may be situations where sharing is
useful and some people want to be able to seal the operations.  All of
this ended up convincing everyone that a file descriptor based approach
was better than a mmap one.
OK, fair enough. This belongs to the changelog IMHO. It is good to know
why existing interfaces do not match the need.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help