Re: [PATCH 1/2] direct-io: Implement generic deferred AIO completions
From: Jan Kara <jack@suse.cz>
Date: 2013-08-12 16:14:59
Also in:
linux-fsdevel, linux-xfs
Hi Dave, I remembered about this patch set and realized I didn't get reply from you regarding the following question (see quoted email below for details): Do you really need to defer completion of appending direct IO? Because generic code makes sure appending direct IO isn't async and thus dio_complete() -> xfs_end_io_direct_write() gets called directly from do_blockdev_direct_IO(). I.e. from a normal context and not from interrupt. I've already addressed rest of your comments so this is the only item that is remaining. On Tue 16-07-13 23:00:27, Jan Kara wrote:
quoted
quoted
-static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async) +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, + bool is_async) { ssize_t transferred = 0;@@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is if (ret == 0) ret = transferred; - if (dio->end_io && dio->result) { - dio->end_io(dio->iocb, offset, transferred, - dio->private, ret, is_async); - } else { - inode_dio_done(dio->inode); - if (is_async) - aio_complete(dio->iocb, ret, 0); - } + if (dio->end_io && dio->result) + dio->end_io(dio->iocb, offset, transferred, dio->private);Ok, so by here we are assuming all filesystem IO completion processing is completed.Yes.quoted
quoted
+ + inode_dio_done(dio->inode); + if (is_async) + aio_complete(dio->iocb, ret, 0); + kmem_cache_free(dio_cache, dio); return ret;Hmmm. I started wonder if this is valid, because XFS is supposed to be doing workqueue based IO completion for appending writes and they are supposed to be run in a workqueue. But, looking at the XFS code, there's actually a bug in the direct IO completion code and we are not defering completion to a workqueue like we should be for the append case. This code in xfs_finish_ioend() demonstrates the intent: if (ioend->io_type == XFS_IO_UNWRITTEN) queue_work(mp->m_unwritten_workqueue, &ioend->io_work); else if (ioend->io_append_trans ||quoted
quoted
quoted
quoted
quoted
quoted
(ioend->io_isdirect && xfs_ioend_is_append(ioend)))queue_work(mp->m_data_workqueue, &ioend->io_work); The problem is that we only ever call xfs_finish_ioend() if is_async is true, and that will never be true for direct IO beyond the current EOF. That's a bug, and is Bad(tm). [ Interestingly, it turns out that dio->is_async is never set for writes beyond EOF because of filesystems that update file sizes before data IO completion occurs (stale data exposure issues). For XFS, that can't happen and so dio->is_async could always be set.... ] What this means is that there's a condition for work queue deferral of DIO IO completion that this patch doesn't handle. Setting deferral only on unwritten extents like this:quoted
@@ -1292,8 +1281,10 @@ __xfs_get_blocks( if (create || !ISUNWRITTEN(&imap)) xfs_map_buffer(inode, bh_result, &imap, offset); if (create && ISUNWRITTEN(&imap)) { - if (direct) + if (direct) { bh_result->b_private = inode; + set_buffer_defer_completion(bh_result); + } set_buffer_unwritten(bh_result); } }misses that case. I suspect Christoph's original patch predated the changes to XFS that added transactional file size updates to IO completion and so didn't take it into account.OK, thanks for catching this. Doing the i_size check in _xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can handle that case by adding __blockdev_direct_IO() flag to defer dio completion to a workqueue. XFS can then set the flag when it sees it needs and i_size update. OK?quoted
quoted
@@ -1390,9 +1381,7 @@ xfs_end_io_direct_write( struct kiocb *iocb, loff_t offset, ssize_t size, - void *private, - int ret, - bool is_async) + void *private) { struct xfs_ioend *ioend = iocb->private;@@ -1414,17 +1403,10 @@ xfs_end_io_direct_write( ioend->io_offset = offset; ioend->io_size = size; - ioend->io_iocb = iocb; - ioend->io_result = ret; if (private && size > 0) ioend->io_type = XFS_IO_UNWRITTEN; - if (is_async) { - ioend->io_isasync = 1; - xfs_finish_ioend(ioend); - } else { - xfs_finish_ioend_sync(ioend); - } + xfs_finish_ioend_sync(ioend); }As i mentioned, the existing code here is buggy. What we should be doing here is: if (is_async) { ioend->io_isasync = 1; xfs_finish_ioend(ioend); } else if (xfs_ioend_is_append(ioend))) { xfs_finish_ioend(ioend); } else { xfs_finish_ioend_sync(ioend); }Umm, I started to wonder why do you actually need to defer the completion for appending ioend. Because if DIO isn't async, dio_complete() won't be called from interrupt context but from __blockdev_direct_IO(). So it seems you can do everything you need directly there without deferring to a workqueue. But maybe there's some locking subtlety I'm missing.
Honza -- Jan Kara [off-list ref] SUSE Labs, CR