Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
From: Jan Kara <jack@suse.cz>
Date: 2017-01-26 09:25:42
Also in:
linux-fsdevel
On Thu 26-01-17 11:36:35, NeilBrown wrote:
On Wed, Jan 25 2017, Theodore Ts'o wrote:quoted
On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:quoted
The reason why I'm thinking open() is because it has to be a contract between a specific application and the kernel. If the application doesn't open the file with the O_TIMEOUT flag, then it shouldn't see nasty non-POSIX timeout errors, even if there is another process that is using that flag on the same file. The only place where that is difficult to manage is when the file is mmap()ed (no file descriptor), so you'd presumably have to disallow mixing mmap and O_TIMEOUT.Well, technically there *is* a file descriptor when you do an mmap. You can close the fd after you call mmap(), but the mmap bumps the refcount on the struct file while the memory map is active. I would argue though that at least for buffered writes, the timeout has to be property of the underlying inode, and if there is an attempt to set timeout on an inode that already has a timeout set to some other non-zero value, the "set timeout" operation should fail with a "timeout already set". That's becuase we really don't want to have to keep track, on a per-page basis, which struct file was responsible for dirtying a page --- and what if it is dirtied by two different file descriptors?You seem to have a very different idea to the one that is forming in my mind. In my vision, once the data has entered the page cache, it doesn't matter at all where it came from. It will remain in the page cache, as a dirty page, until it is successfully written or until an unrecoverable error occurs. There are no timeouts once the data is in the page cache.
Heh, this has somehow drifted away from the original topic of handling IO errors :)
Actually, I'm leaning away from timeouts in general. I'm not against them, but not entirely sure they are useful. To be more specific, I imagine a new open flag "O_IO_NDELAY". It is a bit like O_NDELAY, but it explicitly affects IO, never the actual open() call, and it is explicitly allowed on regular files and block devices. When combined with O_DIRECT, it effectively means "no retries". For block devices and files backed by block devices, REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't help. Non-block-device filesystems would behave differently. e.g. NFS would probably use a RPC_TASK_SOFT call instead of the normal 'hard' call. When used without O_DIRECT: - read would trigger read-ahead much as it does now (which can do nothing if there are resource issues) and would only return data if it was already in the cache.
There was a patch set which did this [1]. Not on per-fd basis but rather on per-IO basis. Andrew blocked it because he was convinced that mincore() is good enough interface for this.
- write would try to allocate a page, tell the filesystem that it is dirty so that journal space is reserved or whatever is needed, and would tell the dirty_pages rate-limiting that another page was dirty. If the rate-limiting reported that we cannot dirty a page without waiting, or if any other needed resources were not available, then the write would fail (-EWOULDBLOCK). - fsync would just fail if there were any dirty pages. It might also do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without any *WAIT* flags. (alternately, fsync could remain unchanged, and sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag). With O_DIRECT there would be a delay, but it would be limited and there would be no retry. There is not currently any way to impose a specific delay on REQ_FAILFAST* requests. Without O_DIRECT, there could be no significant delay, though code might have to wait for a mutex or similar. There are a few places that a timeout could usefully be inserted, but I'm not sure that would be better than just having the app try again in a little while - it would have to be prepared for that anyway. I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with devices that block when no paths are available.
For O_DIRECT writes, there are database people who want to do non-blocking AIO writes. Although the problem they want to solve is different - rather similar to the one patch set [1] is trying to solve for buffered reads - they want to do AIO write and they want it really non-blocking so they can do IO submission directly from computation thread without the cost of the offload to a different process which normally does the IO. Now you need something different for mdadm but interfaces should probably be consistent...
quoted
That being said, I suspect that for many applications, the timeout is going to be *much* more interesting for O_DIRECT writes, and there we can certainly have different timeouts on a per-fd basis. This is especially for cases where the timeout is implemented in storage device, using multi-media extensions, and where the timout might be measured in milliseconds (e.g., no point reading a video frame if its been delayed too long). That being said, it block layer would need to know about this as well, since the timeout needs to be relative to when the read(2) system call is issued, not to when it is finally submitted to the storage device.Yes. If a deadline could be added to "struct bio", and honoured by drivers, then that would make a timeout much more interesting for O_DIRECT.
Timeouts are nice but IMO a lot of work and I suspect you'd really need a dedicated "real-time" IO scheduler for this. Honza [1] https://lwn.net/Articles/636955/ -- Jan Kara [off-list ref] SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>