Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
Date: 2016-05-02 18:52:02
Also in:
linux-block, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm
On Mon, 2016-05-02 at 19:03 +0300, Boaz Harrosh wrote:
On 05/02/2016 06:51 PM, Vishal Verma wrote:quoted
On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:quoted
On 04/29/2016 12:16 AM, Vishal Verma wrote:quoted
All IO in a dax filesystem used to go through dax_do_io, which cannot handle media errors, and thus cannot provide a recovery path that can send a write through the driver to clear errors. Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO path for DAX filesystems, use the same direct_IO path for both DAX and direct_io iocbs, but use the flags to identify when we are in O_DIRECT mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional direct_IO path instead of DAX.Really? What are your thinking here? What about all the current users of O_DIRECT, you have just made them 4 times slower and "less concurrent*" then "buffred io" users. Since direct_IO path will queue an IO request and all. (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) I hate it that you overload the semantics of a known and expected O_DIRECT flag, for special pmem quirks. This is an incompatible and unrelated overload of the semantics of O_DIRECT.We overloaded O_DIRECT a long time ago when we made DAX piggyback on the same path: static inline bool io_is_direct(struct file *filp) { return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping-quoted
host);}No as far as the user is concerned we have not. The O_DIRECT user is still getting all the semantics he wants, .i.e no syncs no memory cache usage, no copies ... Only with DAX the buffered IO is the same since with pmem it is faster. Then why not? The basic contract with the user did not break. The above was just an implementation detail to easily navigate through the Linux vfs IO stack and make the least amount of changes in every FS that wanted to support DAX.(And since dax_do_io is much more like direct_IO then like page-cache IO)quoted
Yes O_DIRECT on a DAX mounted file system will now be slower, but -quoted
quoted
This allows us a recovery path in the form of opening the file with O_DIRECT and writing to it with the usual O_DIRECT semantics (sector alignment restrictions).I understand that you want a sector aligned IO, right? for the clear of errors. But I hate it that you forced all O_DIRECT IO to be slow for this. Can you not make dax_do_io handle media errors? At least for the parts of the IO that are aligned. (And your recovery path application above can use only aligned IO to make sure) Please look for another solution. Even a special IOCTL_DAX_CLEAR_ERROR- see all the versions of this series prior to this one, where we try to do a fallback...And? So now all O_DIRECT APPs go 4 times slower. I will have a look but if it is really so bad than please consider an IOCTL or syscall. Or a special O_DAX_ERRORS flag ...
I'm curious where the 4x slower comes from.. The O_DIRECT path is still without page-cache copies, and nor does it go through request queues (since pmem is a bio-based driver). The only overhead is that of submitting a bio - and while I agree it is more overhead than dax_do_io, 4x seems a bit high.
Please do not trash all the O_DIRECT users, they are the more important clients, like DBs and VMs.
Shouldn't they be using mmaps and dax faults? I was under the impression that the dax_do_io path is a nice-to-have, but for anyone that will want to use DAX, they will want the mmap/fault path, not the IO path. This is just making the IO path 'more correct' by allowing it a way to deal with errors.
Thanks Boazquoted
quoted
[*"less concurrent" because of the queuing done in bdev. Note how pmem is not even multi-queue, and even if it was it will be much slower then DAX because of the code depth and all the locks and task switches done in the block layer. In DAX the final memcpy is done directly on the user-mode thread] Thanks Boaz
_______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs