Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
From: Michael Roth <hidden>
Date: 2022-11-29 19:07:22
Also in:
kvm, linux-api, linux-arch, linux-fsdevel, linux-mm, lkml, qemu-devel
On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:quoted
On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:...quoted
quoted
+static long restrictedmem_fallocate(struct file *file, int mode, + loff_t offset, loff_t len) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + if (mode & FALLOC_FL_PUNCH_HOLE) { + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) + return -EINVAL; + } + + restrictedmem_notifier_invalidate(data, offset, offset + len, true);The KVM restrictedmem ops seem to expect pgoff_t, but here we pass loff_t. For SNP we've made this strange as part of the following patch and it seems to produce the expected behavior:That's correct. Thanks.quoted
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WDj4KxJjhcntBWJUGCjNmMPfZMGQkCSaAo6ElYrGgF0%3D&reserved=0quoted
+ ret = memfd->f_op->fallocate(memfd, mode, offset, len); + restrictedmem_notifier_invalidate(data, offset, offset + len, false); + return ret; +} +<snip>quoted
+int restrictedmem_get_page(struct file *file, pgoff_t offset, + struct page **pagep, int *order) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + struct page *page; + int ret; + + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);This will result in KVM allocating pages that userspace hasn't necessary fallocate()'d. In the case of SNP we need to get the PFN so we can clean up the RMP entries when restrictedmem invalidations are issued for a GFN range.Yes fallocate() is unnecessary unless someone wants to reserve some space (e.g. for determination or performance purpose), this matches its semantics perfectly at: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=67sdTY47cM1IBUG2eJCltYF5SyGOpd9%2FVxVlHUw02tU%3D&reserved=0quoted
If the guest supports lazy-acceptance however, these pages may not have been faulted in yet, and if the VMM defers actually fallocate()'ing space until the guest actually tries to issue a shared->private for that GFN (to support lazy-pinning), then there may never be a need to allocate pages for these backends. However, the restrictedmem invalidations are for GFN ranges so there's no way to know inadvance whether it's been allocated yet or not. The xarray is one option but currently it defaults to 'private' so that doesn't help us here. It might if we introduced a 'uninitialized' state or something along that line instead of just the binary 'shared'/'private' though...How about if we change the default to 'shared' as we discussed at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qzWObDo7ZHW4YjuAjZ5%2B1wEwbqymgBiNM%2BYXiyUSBdI%3D&reserved=0?
Need to look at this a bit more, but I think that could work as well.
quoted
But for now we added a restrictedmem_get_page_noalloc() that uses SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch of memory as part of guest shutdown, and a kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But maybe a boolean param is better? Or maybe SGP_NOALLOC is the better default, and we just propagate an error to userspace if they didn't fallocate() in advance?This (making fallocate() a hard requirement) not only complicates the userspace but also forces the lazy-faulting going through a long path of exiting to userspace. Unless we don't have other options I would not go this way.
Unless I'm missing something, it's already the case that userspace is responsible for handling all the shared->private transitions in response to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only places the additional requirements on the VMM that if they *don't* preallocate, then they'll need to issue the fallocate() prior to issuing the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events. QEMU for example already has a separate 'prealloc' option for cases where they want to prefault all the guest memory, so it makes sense to continue making that an optional thing with regard to UPM. -Mike
Chaoquoted
-Mikequoted
+ if (ret) + return ret; + + *pagep = page; + if (order) + *order = thp_order(compound_head(page)); + + SetPageUptodate(page); + unlock_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(restrictedmem_get_page); -- 2.25.1