Thread (86 messages) 86 messages, 11 authors, 2025-03-26

Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

From: Joel Fernandes <joelagnelf@nvidia.com>
Date: 2025-02-19 15:05:52
Also in: bpf, kvm, linux-arch, linux-hardening, linux-kselftest, linux-mm, linux-riscv, lkml, loongarch, rcu, virtualization, xen-devel

On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
On 17/01/25 16:52, Jann Horn wrote:
quoted
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider [off-list ref] wrote:
quoted
On 14/01/25 19:16, Jann Horn wrote:
quoted
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider [off-list ref] wrote:
quoted
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

Deferral vs early entry danger zone
===================================

This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do
correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on
architectures with virtually indexed caches; that doesn't do TLB
maintenance. When you look for its definition on x86 or arm64, you'll
see that they use the generic implementation which is simply an empty
inline function.
quoted
So after vmapping something, I wouldn't expect isolated CPUs to have
invalid TLB entries for the newly vmapped page.

However, upon vunmap'ing something, the TLB flush is deferred, and thus
stale TLB entries can and will remain on isolated CPUs, up until they
execute the deferred flush themselves (IOW for the entire duration of the
"danger zone").

Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do
TLB flushes when you unmap stuff or when you map stuff, because doing
TLB flushes across the entire system on every vmap/vunmap would be a
bit costly; instead you just do batched TLB flushes in between, in
__purge_vmap_area_lazy().

In other words, the basic idea is that you can keep calling vmap() and
vunmap() a bunch of times without ever doing TLB flushes until you run
out of virtual memory in the vmap region; then you do one big TLB
flush, and afterwards you can reuse the free virtual address space for
new allocations again.

So if you "defer" that batched TLB flush for CPUs that are not
currently running in the kernel, I think the consequence is that those
CPUs may end up with incoherent TLB state after a reallocation of the
virtual address space.
Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
CPU accesses it during early entry;
So the issue is:

CPU1: unmappes vmalloc page X which was previously mapped to physical page
P1.

CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
threshold and sending out IPIs. It then goes ahead and does an allocation
that maps the same virtual page X to physical page P2.

CPU3 is isolated and executes some early entry code before receving said IPIs
which are supposedly deferred by Valentin's patches.

It does not receive the IPI becuase it is deferred, thus access by early
entry code to page X on this CPU results in a UAF access to P1.

Is that the issue?

thanks,

 - Joel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help