Re: [PATCH] mm: save current->journal_info before calling fault/page_mkwrite
From: Andrew Morton <akpm@linux-foundation.org>
Date: 2017-12-14 02:18:47
Also in:
ceph-devel, linux-fsdevel, linux-mm, lkml
On Thu, 14 Dec 2017 10:09:58 +0800 "Yan, Zheng" [off-list ref] wrote:
quoted
On 14 Dec 2017, at 08:59, Andrew Morton [off-list ref] wrote: On Wed, 13 Dec 2017 11:58:36 +0800 "Yan, Zheng" [off-list ref] wrote:quoted
We recently got an Oops report: BUG: unable to handle kernel NULL pointer dereference at (null) IP: jbd2__journal_start+0x38/0x1a2 [...] Call Trace: ext4_page_mkwrite+0x307/0x52b _ext4_get_block+0xd8/0xd8 do_page_mkwrite+0x6e/0xd8 handle_mm_fault+0x686/0xf9b mntput_no_expire+0x1f/0x21e __do_page_fault+0x21d/0x465 dput+0x4a/0x2f7 page_fault+0x22/0x30 copy_user_generic_string+0x2c/0x40 copy_page_to_iter+0x8c/0x2b8 generic_file_read_iter+0x26e/0x845 timerqueue_del+0x31/0x90 ceph_read_iter+0x697/0xa33 [ceph] hrtimer_cancel+0x23/0x41 futex_wait+0x1c8/0x24d get_futex_key+0x32c/0x39a __vfs_read+0xe0/0x130 vfs_read.part.1+0x6c/0x123 handle_mm_fault+0x831/0xf9b __fget+0x7e/0xbf SyS_read+0x4d/0xb5 The reason is that page fault can happen when one filesystem copies data from/to userspace, the filesystem may set current->journal_info. If the userspace memory is mapped to a file on another filesystem, the later filesystem may also want to use current->journal_info.whoops. A cc:stable will be needed here... A filesystem doesn't "copy data from/to userspace". I assume here we're referring to a read() where the source is a pagecache page for filesystem A and the destination is a MAP_SHARED page in filesystem B? But in that case I don't see why filesystem A would have a live ->journal_info? It's just doing a read.Background: when there are multiple cephfs clients read/write a file at time same time, read/write should go directly to object store daemon, using page cache is disabled. ceph_read_iter() uses current->journal_info to pass context information to ceph_readpages(). ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). If not, it tries getting the capability by itself. I checked other filesystem, btrfs probably suffers similar problem for its readpages. (verify_parent_transid() uses current->journal_info and it can be called by by btrfs_get_extent())
Ah. Well please let's get all that into the changelog.
quoted
Can you explain why you chose these two sites? Rather than, for example, way up in handle_mm_fault()?
And please answer this?
quoted
It's hard to believe that a fault handler will alter ->journal_info if it is handling a read fault, so perhaps we only need to do this for a write fault? Although such an optimization probably isn't worthwhile. The whole thing is only about three instructions.