Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
From: Vinayak Menon <hidden>
Date: 2018-11-06 09:29:03
Also in:
linux-mm, lkml
On 11/5/2018 11:52 PM, Laurent Dufour wrote:
Le 05/11/2018 à 08:04, vinayak menon a écrit :quoted
Hi Laurent, On Thu, May 17, 2018 at 4:37 PM Laurent Dufour [off-list ref] wrote:quoted
The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Signed-off-by: Laurent Dufour <redacted> --- fs/proc/task_mmu.c | 5 ++++- fs/userfaultfd.c | 17 +++++++++++++---- mm/khugepaged.c | 3 +++ mm/madvise.c | 6 +++++- mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- mm/mlock.c | 13 ++++++++----- mm/mmap.c | 22 +++++++++++++--------- mm/mprotect.c | 4 +++- mm/swap_state.c | 8 ++++++-- 9 files changed, 89 insertions(+), 40 deletions(-) struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_fault *vmf)@@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,unsigned long *start, unsigned long *end) { - *start = max3(lpfn, PFN_DOWN(vma->vm_start), + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), PFN_DOWN(faddr & PMD_MASK)); - *end = min3(rpfn, PFN_DOWN(vma->vm_end), + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); } -- 2.7.4I have got a crash on 4.14 kernel with speculative page faults enabled and here is my analysis of the problem. The issue was reported only once.Hi Vinayak, Thanks for reporting this.quoted
[23409.303395] el1_da+0x24/0x84 [23409.303400] __radix_tree_lookup+0x8/0x90 [23409.303407] find_get_entry+0x64/0x14c [23409.303410] pagecache_get_page+0x5c/0x27c [23409.303416] __read_swap_cache_async+0x80/0x260 [23409.303420] swap_vma_readahead+0x264/0x37c [23409.303423] swapin_readahead+0x5c/0x6c [23409.303428] do_swap_page+0x128/0x6e4 [23409.303431] handle_pte_fault+0x230/0xca4 [23409.303435] __handle_speculative_fault+0x57c/0x7c8 [23409.303438] do_page_fault+0x228/0x3e8 [23409.303442] do_translation_fault+0x50/0x6c [23409.303445] do_mem_abort+0x5c/0xe0 [23409.303447] el0_da+0x20/0x24 Process A accesses address ADDR (part of VMA A) and that results in a translation fault. Kernel enters __handle_speculative_fault to fix the fault. Process A enters do_swap_page->swapin_readahead->swap_vma_readahead from speculative path. During this time, another process B which shares the same mm, does a mprotect from another CPU which follows mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. After the split, ADDR falls into VMA B, but process A is still using VMA A. Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. swap_vma_readahead->swap_ra_info uses start and end of vma to calculate ptes and nr_pte, which goes wrong due to this and finally resulting in wrong "entry" passed to swap_vma_readahead->__read_swap_cache_async, and in turn causing invalid swapper_space being passed to __read_swap_cache_async->find_get_page, causing an abort. The fix I have tried is to cache vm_start and vm_end also in vmf and use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can send the patch I am a using if you feel that is the right thing to do.I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f allback to the regular page fault. The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series.
Thanks Laurent. I and going to try this patch. With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start and end be a better option ? Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead, in a way that swap_cluster_readahead can take the speculative path ? swap_cluster_readahead doesn't seem to use vma values. Thanks, Vinayak