Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

From: Nadav Amit <hidden>
Date: 2018-12-05 02:09:10
Also in: linux-mm, lkml

On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P [off-list ref] wrote:

On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:

quoted

On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P [off-list ref]
wrote:

On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:

quoted

On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
rick.p.edgecombe@intel.com>
wrote:

On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:

quoted

On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
rick.p.edgecombe@intel.com>
wrote:

On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:

quoted

On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:

quoted

On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
rick.p.edgecombe@intel.com>
wrote:

Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could
get
re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.

So I am trying to finish my patch-set for preventing transient
W+X
mappings
from taking space, by handling kprobes & ftrace that I missed
(thanks
again
for
pointing it out).

But all of the sudden, I don’t understand why we have the
problem
that
this
(your) patch-set deals with at all. We already change the
mappings
to
make
the memory writable before freeing the memory, so why can’t we
make
it
non-executable at the same time? Actually, why do we make the
module
memory,
including its data executable before freeing it???

Yeah, this is really confusing, but I have a suspicion it's a
combination
of the various different configurations and hysterical raisins. We
can't
rely on module_alloc() allocating from the vmalloc area (see
nios2)
nor
can we rely on disable_ro_nx() being available at build time.

If we *could* rely on module allocations always using vmalloc(),
then
we could pass in Rick's new flag and drop disable_ro_nx()
altogether
afaict -- who cares about the memory attributes of a mapping
that's
about
to disappear anyway?

Is it just nios2 that does something different?

Will

Yea it is really intertwined. I think for x86, set_memory_nx
everywhere
would
solve it as well, in fact that was what I first thought the solution
should
be
until this was suggested. It's interesting that from the other
thread
Masami
Hiramatsu referenced, set_memory_nx was suggested last year and
would
have
inadvertently blocked this on x86. But, on the other architectures I
have
since
learned it is a bit different.

It looks like actually most arch's don't re-define set_memory_*, and
so
all
of
the frob_* functions are actually just noops. In which case
allocating
RWX
is
needed to make it work at all, because that is what the allocation
is
going
to
stay at. So in these archs, set_memory_nx won't solve it because it
will
do
nothing.

On x86 I think you cannot get rid of disable_ro_nx fully because
there
is
the
changing of the permissions on the directmap as well. You don't want
some
other
caller getting a page that was left RO when freed and then trying to
write
to
it, if I understand this.

The other reasoning was that calling set_memory_nx isn't doing what
we
are
actually trying to do which is prevent the pages from getting
released
too
early.

A more clear solution for all of this might involve refactoring some
of
the
set_memory_ de-allocation logic out into __weak functions in either
modules
or
vmalloc. As Jessica points out in the other thread though, modules
does
a
lot
more stuff there than the other module_alloc callers. I think it may
take
some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.

But for now with the change in vmalloc, we can block the executable
mapping
freed page re-use issue in a cross platform way.

Please understand me correctly - I didn’t mean that your patches are
not
needed.

Ok, I think I understand. I have been pondering these same things after
Masami
Hiramatsu's comments on this thread the other day.

quoted

All I did is asking - how come the PTEs are executable when they are
cleared
they are executable, when in fact we manipulate them when the module
is
removed.

I think the directmap used to be RWX so maybe historically its trying to
return
it to its default state? Not sure.

quoted

I think I try to deal with a similar problem to the one you encounter
-
broken W^X. The only thing that bothered me in regard to your patches
(and
only after I played with the code) is that there is still a time-
window in
which W^X is broken due to disable_ro_nx().

Totally agree there is overlap in the fixes and we should sync.

What do you think about Andy's suggestion for doing the vfree cleanup in
vmalloc
with arch hooks? So the allocation goes into vfree fully setup and
vmalloc
frees
it and on x86 resets the direct map.

As long as you do it, I have no problem ;-)

You would need to consider all the callers of module_memfree(), and
probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with
all
that”.

I thought you were trying to prevent having any memory that at any time was
W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?

I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.

Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.

I suspect it will not be the last version anyhow.

[1] https://lkml.org/lkml/2018/11/21/305

That seems fine.

And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.

I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).

I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.

Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.

Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.

[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help