Re: [PATCH v2 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
From: Suren Baghdasaryan <surenb@google.com>
Date: 2021-12-07 21:54:23
Also in:
lkml
On Tue, Dec 7, 2021 at 8:50 AM Suren Baghdasaryan [off-list ref] wrote:
On Tue, Dec 7, 2021 at 2:10 AM Michal Hocko [off-list ref] wrote:quoted
On Mon 06-12-21 10:35:03, Suren Baghdasaryan wrote:quoted
On Mon, Nov 29, 2021 at 3:23 AM Michal Hocko [off-list ref] wrote:quoted
On Wed 24-11-21 15:59:05, Suren Baghdasaryan wrote:quoted
oom-reaper and process_mrelease system call should protect against races with exit_mmap which can destroy page tables while they walk the VMA tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing mmap_write_lock. process_mrelease has to elevate mm->mm_users to prevent such race. Both oom-reaper and process_mrelease hold mmap_read_lock when walking the VMA tree. The locking rules and mechanisms could be simpler if exit_mmap takes mmap_write_lock while executing destructive operations such as free_pgtables. Change exit_mmap to hold the mmap_write_lock when calling free_pgtables. Operations like unmap_vmas() and unlock_range() are not destructive and could run under mmap_read_lock but for simplicity we take one mmap_write_lock during almost the entire operation. Note also that because oom-reaper checks VM_LOCKED flag, unlock_range() should not be allowed to race with it. In most cases this lock should be uncontended. Previously, Kirill reported ~4% regression caused by a similar change [1]. We reran the same test and although the individual results are quite noisy, the percentiles show lower regression with 1.6% being the worst case [2]. The change allows oom-reaper and process_mrelease to execute safely under mmap_read_lock without worries that exit_mmap might destroy page tables from under them. [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/ (local) [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/ (local) Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- changes in v2 - Moved mmap_write_unlock to cover remove_vma loop as well, per Matthew Wilcox mm/mmap.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)diff --git a/mm/mmap.c b/mm/mmap.c index bfb0ea164a90..f4e09d390a07 100644 --- a/mm/mmap.c +++ b/mm/mmap.c@@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm) * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block. * - * This needs to be done before calling munlock_vma_pages_all(), + * This needs to be done before calling unlock_range(), * which clears VM_LOCKED, otherwise the oom reaper cannot * reliably test it. */ (void)__oom_reap_task_mm(mm); set_bit(MMF_OOM_SKIP, &mm->flags);Why do you keep this in place?Sorry for the delay, I was out last week. I missed your comment about removing MMF_OOM_SKIP at https://lore.kernel.org/all/YYrO%2FPwdsyaxJaNZ@dhcp22.suse.cz (local) I'll look into removing it in a separate patch, which I think would be cleaner.The point of this code was to sync up the oom_repaer and exit_mmap. Now that your patch uses proper locking for that to happen then MMF_OOM_SKIP is not really necessary. IIRC all you need to guarantee is that the vma tree is empty when exit_mmap does all its work - i.e set mm->mmap to NULL. You can do that after remove_vma loop but it would be equally safe at any time after vma = mm->mmap as the loop relies on the vma chain. Doing that after would be slightly nicer if you ask me.Will do. But if you don't mind I'll post the removal of MMF_OOM_SKIP as a separate patch. This patchset has already been extensively tested and it will be easier for me to test MMF_OOM_SKIP removal separately.quoted
quoted
quoted
Other than that looks OK to me. Maybe we want to add an explicit note that vm_ops::close cannot take mmap_sem in any form. The changelog should also mention that you have considered remove_vma and its previous no MM locking assumption. You can argue that fput is async and close callback shouldn't really need mmap_sem.Should I post another version of this patch with the patch description clarifying these points and additional comments as you suggested?Yes please.Will re-post. So, to clarify, we want: - Patch description to include explanation that remove_vma is now being called under MM lock but this should not be a problem because fput and vm_ops->close do not and should not take mmap_sem. - Add a comment for vm_ops->close that the callback should not take mmap_sem, with explanation that __split_vma and exit_mmap use this callback with the mmap_sem write lock taken. Is that correct?
Assuming my understanding was correct, posted v3 at https://lore.kernel.org/all/20211207215031.2251719-1-surenb@google.com/ (local)
quoted
-- Michal Hocko SUSE Labs