Re: Runtime Memory Validation in Intel-TDX and AMD-SNP

From: Mike Rapoport <rppt@kernel.org>
Date: 2021-07-21 14:00:31
Also in: linux-mm

On Wed, Jul 21, 2021 at 01:02:06PM +0300, Kirill A. Shutemov wrote:

On Wed, Jul 21, 2021 at 12:20:17PM +0300, Mike Rapoport wrote:

quoted

On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:

quoted

On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:

quoted

Hi,

I'd like to get some movement again into the discussion around how to
implement runtime memory validation for confidential guests and wrote up
some thoughts on it.
Below are the results in form of a proposal I put together. Please let
me know your thoughts on it and whether it fits everyones requirements.

Thanks for bringing it up. I'm working on the topic for Intel TDX. See
comments below.

quoted

Thanks,

	Joerg

Proposal for Runtime Memory Validation in Secure Guests on x86
==============================================================

[ snip ]

quoted

	8. When memory is returned to the memblock or page allocators,
	   it is _not_ invalidated. In fact, all memory which is freed
	   need to be valid. If it was marked invalid in the meantime
	   (e.g. if it the memory was used for DMA buffers), the code
	   owning the memory needs to validate it again before freeing
	   it.

	   The benefit of doing memory validation at allocation time is
	   that it keeps the exception handler for invalid memory
	   simple, because no exceptions of this kind are expected under
	   normal operation.

During early boot I treat unaccepted memory as a usable RAM. It only
requires special treatment on memblock_reserve(), which used for early
memory allocation: unaccepted usable RAM has to be accepted, before
reserving.

memblock_reserve() is not always used for early allocations and some of the
early allocations on x86 don't use memblock at all.

Do you mean any codepath in particular?

I don't have examples handy, but in general there are calls to
e820__range_update() that make memory !RAM and it never gets into memblock.
On the other side, memblock_reserve() can be called to reserve memory owned
y firmware that may be already accepted.

quoted

Hooking
validation/acceptance to memblock_reserve() should be fine for PoC but I
suspect there will be caveats for production.

That's why I do PoC. Will see. So far so good. Maybe it will be visible
with smaller pre-accepted memory size.

Maybe some of my concerns only apply to systems with BIOSes weirder than
usual and for VMs all would be fine. 
I'd suggest to experiment with "memmap=" to manually assign various e820
types to memory chunks to see if there are any strange effects.

quoted

For fine-grained accepting/validation tracking I use PageOffline() flags
(it's encoded into mapcount): before adding an unaccepted page to free
list I set the PageOffline() to indicate that the page has to be accepted
before returning from the page allocator. Currently, we never have
PageOffline() set for pages on free lists, so we won't have confusion with
ballooning or memory hotplug.

I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
MAX_ORDER). It is reasonable compromise on speed/latency.

Keeping fine grained accepting/validation information in the memory map
means it cannot be reused across reboots/kexec and there should be an
additional data structure to carry this information. It could be the same
structure that is used by firmware to inform kernel about usable memory,
just it needs to live after boot and get updates about new (in)validations.
Doing those in 2M/4M chunks will help to prevent this structure from
exploding.

Yeah, we would need to reconstruct the EFI map somehow. Or we can give
most of memory back to the host and accept/validate the memory again after
reboot/kexec. I donno.

quoted

BTW, as Dave mentioned, the deferred struct page init can also take care of
the validation.

That was my first thought too and I tried it just to realize that it is
not what we want. If we would accept page on page struct init it means we
would make host allocate all memory assigned to the guest on boot even if
guest actually use small portion of it.

Yep, you are right.

Also deferred page init only allows to scale validation across multiple
CPUs, but doesn't allow to get to userspace before we done with it. See
wait_for_completion(&pgdat_init_all_done_comp).

True.

-- 
Sincerely yours,
Mike.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help