Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()

From: NeilBrown <hidden>
Date: 2017-01-27 06:03:24
Also in: linux-fsdevel

On Thu, Jan 26 2017, Theodore Ts'o wrote:

On Fri, Jan 27, 2017 at 09:19:10AM +1100, NeilBrown wrote:

quoted

I don't think it has.
The original topic was about gracefully handling of recoverable IO errors.
The question was framed as about retrying fsync() is it reported an
error, but this was based on a misunderstand.  fsync() doesn't report
an error for recoverable errors.  It hangs.
So the original topic is really about gracefully handling IO operations
which currently can hang indefinitely.

Well, the problem is that it is up to the device driver to decide when
an error is recoverable or not.  This might include waiting X minutes,
and then deciding that the fibre channel connection isn't coming back,
and then turning it into an unrecoverable error.  Or for other
devices, the timeout might be much smaller.

Which is fine --- I think that's where the decision ought to live, and
if users want to tune a different timeout before the driver stops
waiting, that should be between the system administrator and the
device driver /sys tuning knob.

Completely agree.  Whether a particular condition should be treated as
recoverable or unrecoverable is a question and that driver authors and
sysadmins could reasonably provide input to.
But once that decision has been made, the application must accept the
decision.  EIO means unrecoverable.  There is never any point retrying.
Recoverable manifests as a hang, awaiting recovery.

I recently noticed that PG_error is effectively meaningless for write
errors.  filemap_fdatawait_range() can clear it, and the return value is
often ignored. AS_EIO is the really meaningful flag for write errors,
and it is per-file, not per-page.

quoted

When combined with O_DIRECT, it effectively means "no retries".  For
block devices and files backed by block devices,
REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
help.

Absolutely no retries?  Even TCP retries in the case of iSCSI?  I
don't think turning every TCP packet drop into EWOULDBLOCK would make
sense under any circumstances.  What might make sense is to have a
"short timeout" where it's up to the block device to decide what
"short timeout" means.

The implemented semantics of REQ_FAILFAST_* are to disable retries on
certain types of fail.  That is what I was meaning to refer to.
There are retries are many levels in the protocol stack, from the
collision detection retries at the data-link layer, to packet-level and
connection level and command level.  Some have predefined timeouts and
should be left alone.  Others have no timeouts and need to be disabled.
There are probably others in the middle.
I was looking for a semantic that could be implemented on top of current
interfaces, which means working with the REQ_FAILFAST_* semantic.

EWOULDBLOCK is also a little misleading, because even if the I/O
request is submitted immediately to the block device and immediately
serviced and returned, the I/O request would still be "blocking".
Maybe ETIMEDOUT instead?

Maybe - I won't argue.

quoted

And aio_write() isn't non-blocking for O_DIRECT already because .... oh,
it doesn't even try.  Is there something intrinsically hard about async
O_DIRECT writes, or is it just that no-one has written acceptable code
yet?

AIO/DIO writes can indeed be non-blocking, if the file system doesn't
need to do any metadata operations.  So if the file is preallocated,
you should be able to issue an async DIO write without losing the CPU.

Yes, I see that now.  I misread some of the code.
Thanks.

NeilBrown

quoted

A truly async O_DIRECT aio_write() combined with a working io_cancel()
would probably be sufficient.  The block layer doesn't provide any way
to cancel a bio though, so that would need to be wired up.

Kent Overstreet worked up io_cancel for AIO/DIO writes when he was at
Google.  As I recall the patchset did get posted a few times, but it
never ended up getted accepted for upstream adoption.

We even had some very rough code that would propagate the cancellation
request to the hard drive, for those hard drives that had a facility
for accepting a cancellation request for an I/O which was queued via
NCQ but which hadn't executed yet.  It sort-of worked, but it never
hit a state where it could be published before the project was
abandoned.

						- Ted

Attachments

signature.asc [application/pgp-signature] 832 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help