Re: [PATCH 03/14] mm: Protect operations adding pages to page cache with invalidate_lock
From: Jan Kara <jack@suse.cz>
Date: 2021-06-08 12:19:18
Also in:
ceph-devel, linux-ext4, linux-f2fs-devel, linux-fsdevel, linux-mm, linux-xfs
On Mon 07-06-21 09:09:22, Darrick J. Wong wrote:
On Mon, Jun 07, 2021 at 04:52:13PM +0200, Jan Kara wrote:quoted
Currently, serializing operations such as page fault, read, or readahead against hole punching is rather difficult. The basic race scheme is like: fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. truncate_inode_pages_range() <create pages in page cache here> <update fs block mapping and free blocks> Now the problem is in this way read / page fault / readahead can instantiate pages in page cache with potentially stale data (if blocks get quickly reused). Avoiding this race is not simple - page locks do not work because we want to make sure there are *no* pages in given range. inode->i_rwsem does not work because page fault happens under mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes the performance for mixed read-write workloads suffer. So create a new rw_semaphore in the address_space - invalidate_lock - that protects adding of pages to page cache for page faults / reads / readahead. Signed-off-by: Jan Kara <jack@suse.cz>
...
quoted
+->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. VFS provides mapping->invalidate_lock for this +and acquires it in shared mode in paths loading pages from disk +(filemap_fault(), filemap_read(), readahead paths). The filesystem is +responsible for taking this lock in its fallocate implementation and generally +whenever the page cache contents needs to be invalidated because a block is +moving from under a page.Having a page cache invalidation lock isn't optional anymore, so I think these last two sentences could be condensed: "...from reloading now-stale page cache contents from disk. Since VFS acquires mapping->invalidate_lock in shared mode when loading pages from disk (filemap_fault(), filemap_read(), readahead), the fallocate implementation must take the invalidate_lock to prevent reloading."quoted
+ +->copy_file_range and ->remap_file_range implementations need to serialize +against modifications of file data while the operation is running. For +blocking changes through write(2) and similar operations inode->i_rwsem can be +used. For blocking changes through memory mapping, the filesystem can use +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite +implementation.Following the same line of reasoning, if taking the invalidate_lock is no longer optional, then the conditional language in this last sentence is incorrect. How about: "To block changes to file contents via a memory mapping during the operation, the filesystem must take mapping->invalidate_lock to coordinate with ->page_mkwrite." The code changes look fine to me, though I'm no mm expert. ;)
OK, I've updated the documentation as you suggested. Thanks for review. Honza -- Jan Kara [off-list ref] SUSE Labs, CR