Re: [PATCH] mm: save current->journal_info before calling fault/page_mkwrite

From: Andrew Morton <akpm@linux-foundation.org>
Date: 2017-12-14 02:18:47
Also in: ceph-devel, linux-fsdevel, linux-mm, lkml

On Thu, 14 Dec 2017 10:09:58 +0800 "Yan, Zheng" [off-list ref] wrote:

quoted

On 14 Dec 2017, at 08:59, Andrew Morton [off-list ref] wrote:

On Wed, 13 Dec 2017 11:58:36 +0800 "Yan, Zheng" [off-list ref] wrote:

quoted

We recently got an Oops report:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: jbd2__journal_start+0x38/0x1a2
[...]
Call Trace:
 ext4_page_mkwrite+0x307/0x52b
 _ext4_get_block+0xd8/0xd8
 do_page_mkwrite+0x6e/0xd8
 handle_mm_fault+0x686/0xf9b
 mntput_no_expire+0x1f/0x21e
 __do_page_fault+0x21d/0x465
 dput+0x4a/0x2f7
 page_fault+0x22/0x30
 copy_user_generic_string+0x2c/0x40
 copy_page_to_iter+0x8c/0x2b8
 generic_file_read_iter+0x26e/0x845
 timerqueue_del+0x31/0x90
 ceph_read_iter+0x697/0xa33 [ceph]
 hrtimer_cancel+0x23/0x41
 futex_wait+0x1c8/0x24d
 get_futex_key+0x32c/0x39a
 __vfs_read+0xe0/0x130
 vfs_read.part.1+0x6c/0x123
 handle_mm_fault+0x831/0xf9b
 __fget+0x7e/0xbf
 SyS_read+0x4d/0xb5

The reason is that page fault can happen when one filesystem copies
data from/to userspace, the filesystem may set current->journal_info.
If the userspace memory is mapped to a file on another filesystem,
the later filesystem may also want to use current->journal_info.

whoops.

A cc:stable will be needed here...

A filesystem doesn't "copy data from/to userspace".  I assume here
we're referring to a read() where the source is a pagecache page for
filesystem A and the destination is a MAP_SHARED page in filesystem B?

But in that case I don't see why filesystem A would have a live
->journal_info?  It's just doing a read.


Background: when there are multiple cephfs clients read/write a file at time same time, read/write should go directly to object store daemon, using page cache is disabled.

ceph_read_iter() uses current->journal_info to pass context information to ceph_readpages().  ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). If not, it tries getting the capability by itself. I checked other filesystem, btrfs probably suffers similar problem for its readpages. (verify_parent_transid() uses current->journal_info and it can be called by by btrfs_get_extent())

Ah.  Well please let's get all that into the changelog.

quoted

Can you explain why you chose these two sites?  Rather than, for
example, way up in handle_mm_fault()?

And please answer this?

quoted

It's hard to believe that a fault handler will alter ->journal_info if
it is handling a read fault, so perhaps we only need to do this for a
write fault?  Although such an optimization probably isn't worthwhile. 
The whole thing is only about three instructions.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help