Thread (7 messages) 7 messages, 3 authors, 2017-12-14

Re: [PATCH] mm: save current->journal_info before calling fault/page_mkwrite

From: Andrew Morton <akpm@linux-foundation.org>
Date: 2017-12-14 02:18:47
Also in: ceph-devel, linux-fsdevel, linux-mm, lkml

On Thu, 14 Dec 2017 10:09:58 +0800 "Yan, Zheng" [off-list ref] wrote:
quoted
On 14 Dec 2017, at 08:59, Andrew Morton [off-list ref] wrote:

On Wed, 13 Dec 2017 11:58:36 +0800 "Yan, Zheng" [off-list ref] wrote:
quoted
We recently got an Oops report:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: jbd2__journal_start+0x38/0x1a2
[...]
Call Trace:
 ext4_page_mkwrite+0x307/0x52b
 _ext4_get_block+0xd8/0xd8
 do_page_mkwrite+0x6e/0xd8
 handle_mm_fault+0x686/0xf9b
 mntput_no_expire+0x1f/0x21e
 __do_page_fault+0x21d/0x465
 dput+0x4a/0x2f7
 page_fault+0x22/0x30
 copy_user_generic_string+0x2c/0x40
 copy_page_to_iter+0x8c/0x2b8
 generic_file_read_iter+0x26e/0x845
 timerqueue_del+0x31/0x90
 ceph_read_iter+0x697/0xa33 [ceph]
 hrtimer_cancel+0x23/0x41
 futex_wait+0x1c8/0x24d
 get_futex_key+0x32c/0x39a
 __vfs_read+0xe0/0x130
 vfs_read.part.1+0x6c/0x123
 handle_mm_fault+0x831/0xf9b
 __fget+0x7e/0xbf
 SyS_read+0x4d/0xb5

The reason is that page fault can happen when one filesystem copies
data from/to userspace, the filesystem may set current->journal_info.
If the userspace memory is mapped to a file on another filesystem,
the later filesystem may also want to use current->journal_info.
whoops.

A cc:stable will be needed here...

A filesystem doesn't "copy data from/to userspace".  I assume here
we're referring to a read() where the source is a pagecache page for
filesystem A and the destination is a MAP_SHARED page in filesystem B?

But in that case I don't see why filesystem A would have a live
->journal_info?  It's just doing a read.

Background: when there are multiple cephfs clients read/write a file at time same time, read/write should go directly to object store daemon, using page cache is disabled.

ceph_read_iter() uses current->journal_info to pass context information to ceph_readpages().  ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). If not, it tries getting the capability by itself. I checked other filesystem, btrfs probably suffers similar problem for its readpages. (verify_parent_transid() uses current->journal_info and it can be called by by btrfs_get_extent())
Ah.  Well please let's get all that into the changelog.
quoted
Can you explain why you chose these two sites?  Rather than, for
example, way up in handle_mm_fault()?
And please answer this?
quoted
It's hard to believe that a fault handler will alter ->journal_info if
it is handling a read fault, so perhaps we only need to do this for a
write fault?  Although such an optimization probably isn't worthwhile. 
The whole thing is only about three instructions.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help