Thread (78 messages) 78 messages, 10 authors, 2021-02-04

Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

From: Mike Rapoport <rppt@kernel.org>
Date: 2021-01-28 09:26:38
Also in: linux-api, linux-arch, linux-fsdevel, linux-kselftest, linux-mm, linux-riscv, lkml, nvdimm

On Tue, Jan 26, 2021 at 01:08:23PM +0100, Michal Hocko wrote:
On Tue 26-01-21 12:56:48, David Hildenbrand wrote:
quoted
On 26.01.21 12:46, Michal Hocko wrote:
quoted
On Thu 21-01-21 14:27:19, Mike Rapoport wrote:
quoted
From: Mike Rapoport <redacted>

Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.

Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.

As pages allocated by secretmem become unmovable, use CMA to back large
page caches so that page allocator won't be surprised by failing attempt to
migrate these pages.

The CMA area used by secretmem is controlled by the "secretmem=" kernel
parameter. This allows explicit control over the memory available for
secretmem and provides upper hard limit for secretmem consumption.
OK, so I have finally had a look at this closer and this is really not
acceptable. I have already mentioned that in a response to other patch
but any task is able to deprive access to secret memory to other tasks
and cause OOM killer which wouldn't really recover ever and potentially
panic the system. Now you could be less drastic and only make SIGBUS on
fault but that would be still quite terrible. There is a very good
reason why hugetlb implements is non-trivial reservation system to avoid
exactly these problems.
So, if I understand your concerns correct this implementation has two
issues:
1) allocation failure at page fault that causes unrecoverable OOM and
2) a possibility for an unprivileged user to deplete secretmem pool and
cause (1) to others

I'm not really familiar with OOM internals, but when I simulated an
allocation failure in my testing only the allocating process and it's
parent were OOM-killed and then the system continued normally. 

You are right, it would be better if we SIGBUS instead of OOM but I don't
agree SIGBUS is terrible. As we started to draw parallels with hugetlbfs
even despite it's complex reservation system, hugetlb_fault() may fail to
allocate pages from CMA and this still will cause SIGBUS.

And hugetlb pools may be also depleted by anybody by calling
mmap(MAP_HUGETLB) and there is no any limiting knob for this, while
secretmem has RLIMIT_MEMLOCK.

That said, simply replacing VM_FAULT_OOM with VM_FAULT_SIGBUS makes
secretmem at least as controllable and robust than hugeltbfs even without
complex reservation at mmap() time.
quoted
quoted
So unless I am really misreading the code
Nacked-by: Michal Hocko [off-list ref]

That doesn't mean I reject the whole idea. There are some details to
sort out as mentioned elsewhere but you cannot really depend on
pre-allocated pool which can fail at a fault time like that.
So, to do it similar to hugetlbfs (e.g., with CMA), there would have to be a
mechanism to actually try pre-reserving (e.g., from the CMA area), at which
point in time the pages would get moved to the secretmem pool, and a
mechanism for mmap() etc. to "reserve" from these secretmem pool, such that
there are guarantees at fault time?
yes, reserve at mmap time and use during the fault. But this all sounds
like a self inflicted problem to me. Sure you can have a pre-allocated
or more dynamic pool to reduce the direct mapping fragmentation but you
can always fall back to regular allocatios. In other ways have the pool
as an optimization rather than a hard requirement. With a careful access
control this sounds like a manageable solution to me.
I'd really wish we had this discussion for earlier spins of this series,
but since this didn't happen let's refresh the history a bit.

One of the major pushbacks on the first RFC [1] of the concept was about
the direct map fragmentation. I tried really hard to find data that shows
what is the performance difference with different page sizes in the direct
map and I didn't find anything.

So presuming that large pages do provide advantage the first implementation
of secretmem used PMD_ORDER allocations to amortise the effect of the
direct map fragmentation and then handed out 4k pages at each fault. In
addition there was an option to reserve a finite pool at boot time and
limit secretmem allocations only to that pool.

At some point David suggested to use CMA to improve overall flexibility
[3], so I switched secretmem to use CMA.

Now, with the data we have at hand (my benchmarks and Intel's report David
mentioned) I'm even not sure this whole pooling even required.

I like the idea to have a pool as an optimization rather than a hard
requirement but I don't see why would it need a careful access control. As
the direct map fragmentation is not necessarily degrades the performance
(and even sometimes it actually improves it) and even then the degradation
is small, trying a PMD_ORDER allocation for a pool and then falling back to
4K page may be just fine.

I think we could have something like this (error handling is mostly
omitted):

static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
{
	struct page *page = alloc_pages(gfp, PMD_PAGE_ORDER);

	if (!page)
		return -ENOMEM;

	/* add large page to pool */
	
	return 0;
}

static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx,
					 gfp_t gfp)
{
	struct page *page;
	...

	if (gen_pool_avail(pool) < PAGE_SIZE) {
		err = secretmem_pool_increase(ctx, gfp);
		if (!err) {
			addr = gen_pool_alloc(pool, PAGE_SIZE);
			if (addr)
				page = virt_to_page(addr);
		}
	}

	if (!page)
		page = alloc_page(gfp);

	return page;	
}

[1] https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/ (local)
[2] https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org/ (local)
[3] https://lore.kernel.org/lkml/03ec586d-c00c-c57e-3118-7186acb7b823@redhat.com/#t (local)

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help