Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

From: Andreas Dilger <hidden>
Date: 2019-08-26 19:10:25

On Aug 26, 2019, at 2:39 AM, Jan Kara [off-list ref] wrote:

On Sat 24-08-19 12:18:40, Dave Chinner wrote:

quoted

On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:

quoted


On 19/8/23 18:16, Dave Chinner wrote:

quoted

On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:

quoted

Hi Dave,

On 19/8/22 13:40, Dave Chinner wrote:

quoted

On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:

quoted

Hi Ted，

On 19/8/21 00:08, Theodore Y. Ts'o wrote:

quoted

On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:

quoted

I've tested parallel dio reads with dioread_nolock, it
doesn't have significant performance improvement and still
poor compared with reverting parallel dio reads. IMO, this
is because with parallel dio reads, it take inode shared
lock at the very beginning in ext4_direct_IO_read().

Why is that a problem?  It's a shared lock, so parallel
threads should be able to issue reads without getting
serialized?

The above just tells the result that even mounting with
dioread_nolock, parallel dio reads still has poor performance
than before (w/o parallel dio reads).

quoted

Are you using sufficiently fast storage devices that you're
worried about cache line bouncing of the shared lock?  Or do
you have some other concern, such as some other thread
taking an exclusive lock?

The test case is random read/write described in my first
mail. And

Regardless of dioread_nolock, ext4_direct_IO_read() is taking
inode_lock_shared() across the direct IO call.  And writes in
ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
even though it gets dropped quite early when overwrite &&
dioread_nolock is set.  But just taking the lock exclusively
in write fro a short while is enough to kill all shared
locking concurrency...

quoted

from my preliminary investigation, shared lock consumes more
in such scenario.

If the write lock is also shared, then there should not be a
scalability issue. The shared dio locking is only half-done in
ext4, so perhaps comparing your workload against XFS would be
an informative exercise...

I've done the same test workload on xfs, it behaves the same as
ext4 after reverting parallel dio reads and mounting with
dioread_lock.

Ok, so the problem is not shared locking scalability ('cause
that's what XFS does and it scaled fine), the problem is almost
certainly that ext4 is using exclusive locking during
writes...

Agree. Maybe I've misled you in my previous mails.I meant shared
lock makes worse in case of mixed random read/write, since we
would always take inode lock during write.  And it also conflicts
with dioread_nolock. It won't take any inode lock before with
dioread_nolock during read, but now it always takes a shared
lock.

No, you didn't mislead me. IIUC, the shared locking was added to the
direct IO read path so that it can't run concurrently with
operations like hole punch that free the blocks the dio read might
currently be operating on (use after free).

i.e. the shared locking fixes an actual bug, but the performance
regression is a result of only partially converting the direct IO
path to use shared locking. Only half the job was done from a
performance perspective. Seems to me that the two options here to
fix the performance regression are to either finish the shared
locking conversion, or remove the shared locking on read and re-open
a potential data exposure issue...

We actually had a separate locking mechanism in ext4 code to avoid stale
data exposure during hole punch when unlocked DIO reads were running. But
it was kind of ugly and making things complex. I agree we need to move ext4
DIO path conversion further to avoid taking exclusive lock when we won't
actually need it.

It seems to me that the right solution for the short term is to revert
the patch in question, since that appears to be incomplete, and reverting
it will restore the performance.  I haven't seen any comments posted with
a counter-example that the original patch actually improved performance,
or that reverting it will cause some other performance regression.

We can then leave implementing a more complete solution to a later kernel.

Cheers, Andreas

Attachments

signature.asc [application/pgp-signature] 873 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help