Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
From: Vlastimil Babka <hidden>
Date: 2022-10-17 13:00:30
Also in:
kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel
On 9/15/22 16:29, Chao Peng wrote:
From: "Kirill A. Shutemov" <redacted> KVM can use memfd-provided memory for guest memory. For normal userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its virtual address space and then tells KVM to use the virtual address to setup the mapping in the secondary page table (e.g. EPT). With confidential computing technologies like Intel TDX, the memfd-provided memory may be encrypted with special key for special software domain (e.g. KVM guest) and is not expected to be directly accessed by userspace. Precisely, userspace access to such encrypted memory may lead to host crash so it should be prevented. This patch introduces userspace inaccessible memfd (created with MFD_INACCESSIBLE). Its memory is inaccessible from userspace through ordinary MMU access (e.g. read/write/mmap) but can be accessed via in-kernel interface so KVM can directly interact with core-mm without the need to map the memory into KVM userspace. It provides semantics required for KVM guest private(encrypted) memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV. KVM userspace is still in charge of the lifecycle of the memfd. It should pass the opened fd to KVM. KVM uses the kernel APIs newly added in this patch to obtain the physical memory address and then populate the secondary page table entries. The userspace inaccessible memfd can be fallocate-ed and hole-punched from userspace. When hole-punching happens, KVM can get notified through inaccessible_notifier it then gets chance to remove any mapped entries of the range in the secondary page tables. The userspace inaccessible memfd itself is implemented as a shim layer on top of real memory file systems like tmpfs/hugetlbfs but this patch only implemented tmpfs. The allocated memory is currently marked as unmovable and unevictable, this is required for current confidential usage. But in future this might be changed. Signed-off-by: Kirill A. Shutemov <redacted> Signed-off-by: Chao Peng <redacted> ---
...
+static long inaccessible_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ struct inaccessible_data *data = file->f_mapping->private_data;
+ struct file *memfd = data->memfd;
+ int ret;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+ return -EINVAL;
+ }
+
+ ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+ inaccessible_notifier_invalidate(data, offset, offset + len);Wonder if invalidate should precede the actual hole punch, otherwise we open a window where the page tables point to memory no longer valid?
+ return ret; +} +
...
+
+static struct file_system_type inaccessible_fs = {
+ .owner = THIS_MODULE,
+ .name = "[inaccessible]",Dunno where exactly is this name visible, but shouldn't it better be "[memfd:inaccessible]"?
+ .init_fs_context = inaccessible_init_fs_context, + .kill_sb = kill_anon_super, +}; +