Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()

From: Jan Kara <jack@suse.cz>
Date: 2017-01-26 09:25:42
Also in: linux-fsdevel

On Thu 26-01-17 11:36:35, NeilBrown wrote:

On Wed, Jan 25 2017, Theodore Ts'o wrote:

quoted

On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:

quoted

The reason why I'm thinking open() is because it has to be a contract
between a specific application and the kernel. If the application
doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
nasty non-POSIX timeout errors, even if there is another process that
is using that flag on the same file.

The only place where that is difficult to manage is when the file is
mmap()ed (no file descriptor), so you'd presumably have to disallow
mixing mmap and O_TIMEOUT.

Well, technically there *is* a file descriptor when you do an mmap.
You can close the fd after you call mmap(), but the mmap bumps the
refcount on the struct file while the memory map is active.

I would argue though that at least for buffered writes, the timeout
has to be property of the underlying inode, and if there is an attempt
to set timeout on an inode that already has a timeout set to some
other non-zero value, the "set timeout" operation should fail with a
"timeout already set".  That's becuase we really don't want to have to
keep track, on a per-page basis, which struct file was responsible for
dirtying a page --- and what if it is dirtied by two different file
descriptors?

You seem to have a very different idea to the one that is forming in my
mind.  In my vision, once the data has entered the page cache, it
doesn't matter at all where it came from.  It will remain in the page
cache, as a dirty page, until it is successfully written or until an
unrecoverable error occurs.  There are no timeouts once the data is in
the page cache.

Heh, this has somehow drifted away from the original topic of handling IO
errors :)

Actually, I'm leaning away from timeouts in general.  I'm not against
them, but not entirely sure they are useful.

To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
bit like O_NDELAY, but it explicitly affects IO, never the actual open()
call, and it is explicitly allowed on regular files and block devices.

When combined with O_DIRECT, it effectively means "no retries".  For
block devices and files backed by block devices,
REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
help.
Non-block-device filesystems would behave differently.  e.g. NFS would
probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.

When used without O_DIRECT:
 - read would trigger read-ahead much as it does now (which can do
   nothing if there are resource issues) and would only return data
   if it was already in the cache.

There was a patch set which did this [1]. Not on per-fd basis but rather on
per-IO basis. Andrew blocked it because he was convinced that mincore() is
good enough interface for this.

 - write would try to allocate a page, tell the filesystem that it
   is dirty so that journal space is reserved or whatever is needed,
   and would tell the dirty_pages rate-limiting that another page was
   dirty.  If the rate-limiting reported that we cannot dirty a page
   without waiting, or if any other needed resources were not available,
   then the write would fail (-EWOULDBLOCK).
 - fsync would just fail if there were any dirty pages.  It might also
   do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
   any *WAIT* flags. (alternately, fsync could remain unchanged, and
   sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).


With O_DIRECT there would be a delay, but it would be limited and there
would be no retry.  There is not currently any way to impose a specific
delay on REQ_FAILFAST* requests.
Without O_DIRECT, there could be no significant delay, though code might
have to wait for a mutex or similar.
There are a few places that a timeout could usefully be inserted, but
I'm not sure that would be better than just having the app try again in
a little while - it would have to be prepared for that anyway.

I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
devices that block when no paths are available.

For O_DIRECT writes, there are database people who want to do non-blocking
AIO writes. Although the problem they want to solve is different - rather
similar to the one patch set [1] is trying to solve for buffered reads -
they want to do AIO write and they want it really non-blocking so they can
do IO submission directly from computation thread without the cost of the
offload to a different process which normally does the IO.

Now you need something different for mdadm but interfaces should probably
be consistent...

quoted

That being said, I suspect that for many applications, the timeout is
going to be *much* more interesting for O_DIRECT writes, and there we
can certainly have different timeouts on a per-fd basis.  This is
especially for cases where the timeout is implemented in storage
device, using multi-media extensions, and where the timout might be
measured in milliseconds (e.g., no point reading a video frame if its
been delayed too long).  That being said, it block layer would need to
know about this as well, since the timeout needs to be relative to
when the read(2) system call is issued, not to when it is finally
submitted to the storage device.

Yes. If a deadline could be added to "struct bio", and honoured by
drivers, then that would make a timeout much more interesting for
O_DIRECT.

Timeouts are nice but IMO a lot of work and I suspect you'd really need a
dedicated "real-time" IO scheduler for this.

								Honza

[1] https://lwn.net/Articles/636955/

-- 
Jan Kara [off-list ref]
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help