Thread (25 messages) 25 messages, 5 authors, 2016-05-08

Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

From: Boaz Harrosh <hidden>
Date: 2016-05-02 17:44:01
Also in: linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm

On 05/02/2016 07:49 PM, Dan Williams wrote:
On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh [off-list ref] wrote:
quoted
On 05/02/2016 07:01 PM, Dan Williams wrote:
quoted
On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh [off-list ref] wrote:
quoted
On 04/29/2016 12:16 AM, Vishal Verma wrote:
quoted
All IO in a dax filesystem used to go through dax_do_io, which cannot
handle media errors, and thus cannot provide a recovery path that can
send a write through the driver to clear errors.

Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
path for DAX filesystems, use the same direct_IO path for both DAX and
direct_io iocbs, but use the flags to identify when we are in O_DIRECT
mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
direct_IO path instead of DAX.
Really? What are your thinking here?

What about all the current users of O_DIRECT, you have just made them
4 times slower and "less concurrent*" then "buffred io" users. Since
direct_IO path will queue an IO request and all.
(And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])

I hate it that you overload the semantics of a known and expected
O_DIRECT flag, for special pmem quirks. This is an incompatible
and unrelated overload of the semantics of O_DIRECT.
I think it is the opposite situation, it us undoing the premature
overloading of O_DIRECT that went in without performance numbers.
We have tons of measurements. Is not hard to imagine the results though.
Specially the 1000 threads case
quoted
This implementation clarifies that dax_do_io() handles the lack of a
page cache for buffered I/O and O_DIRECT behaves as it nominally would
by sending an I/O to the driver.
quoted
It has the benefit of matching the
error semantics of a typical block device where a buffered write could
hit an error filling the page cache, but an O_DIRECT write potentially
triggers the drive to remap the block.
I fail to see how in writes the device error semantics regarding remapping of
blocks is any different between buffered and direct IO. As far as the block
device it is the same exact code path. All The big difference is higher in the
VFS.

And ... So you are willing to sacrifice the 99% hotpath for the sake of the
1% error path? and piggybacking on poor O_DIRECT.

Again there are tons of O_DIRECT apps out there, why are you forcing them to
change if they want true pmem performance?
This isn't forcing them to change.  This is the path of least surprise
as error semantics are identical to a typical block device.  Yes, an
application can go faster by switching to the "buffered" / dax_do_io()
path it can go even faster to switch to mmap() I/O and use DAX
directly.  If we can later optimize the O_DIRECT path to bring it's
performance more in line with dax_do_io(), great, but the
implementation should be correct first and optimized later.
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:

read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io

In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.

Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.

So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?

In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)

[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above

Thanks
Boaz
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help