Re: [LSF/MM TOPIC] do we really need PG_error at all?

From: NeilBrown <hidden>
Date: 2017-02-27 00:27:05
Also in: linux-fsdevel, linux-mm, linux-scsi, lkml

On Sun, Feb 26 2017, James Bottomley wrote:

On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:

quoted

On Sun, Feb 26 2017, James Bottomley wrote:

quoted

[added linux-scsi and linux-block because this is part of our error
handling as well]
On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:

quoted

Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
just not understanding the semantics here.

As I was looking into -ENOSPC handling in cephfs, I noticed that
PG_error is only ever tested in one place [1] 
__filemap_fdatawait_range, which does this:

	if (TestClearPageError(page))
		ret = -EIO;

This error code will override any AS_* error that was set in the
mapping. Which makes me wonder...why don't we just set this error 
in the mapping and not bother with a per-page flag? Could we
potentially free up a page flag by eliminating this?

Note that currently the AS_* codes are only set for write errors 
not for reads and we have no mapping error handling at all for swap
pages, but I'm sure this is fixable.

How is a read error different from a failure to set PG_uptodate?
Does PG_error suppress retries?

We don't do any retries in the code above the block layer (or at least
we shouldn't).

I was wondering about what would/should happen if a read request was
re-issued for some reason.  Should the error flag on the page cause an
immediate failure, or should it try again.
If read-ahead sees a read-error on some future page, is it necessary to
record the error so subsequent read-aheads don't notice the page is
missing and repeatedly try to re-load it?
When the application eventually gets to the faulty page, should a read
be tried then, or is the read-ahead failure permanent?

quoted

From the I/O layer point of view we take great pains to try to 
pinpoint the error exactly to the sector.  We reflect this up by 
setting the PG_error flag on the page where the error occurred.  If 
we only set the error on the mapping, we lose that granularity, 
because the mapping is mostly at the file level (or VMA level for
anon pages).

Are you saying that the IO layer finds the page in the bi_io_vec and
explicitly sets PG_error,

I didn't say anything about the mechanism.  I think the function you're
looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
the position in the request.  Block maps the position to bio and the
bio completion maps to page.  So the actual granularity seen in the
upper layer depends on how the page to bio mapping is done.

If the block layer is just returning the status at a per-bio level (which
makes perfect sense), then this has nothing directly to do with the
PG_error flag.

The page cache needs to do something with bi_error, but it isn't
immediately clear that it needs to set PG_error.

quoted

 rather than just passing an error indication to bi_end_io ??  That
would seem to be wrong as the page may not be in the page cache.

Usually pages in the mpage_end_io path are pinned, I think.

quoted

 So I guess I misunderstand you.

quoted

So I think the question for filesystem people from us would be do 
you care about this accuracy?  If it's OK just to know an error
occurred somewhere in this file, then perhaps we don't need it.

I had always assumed that a bio would either succeed or fail, and 
that no finer granularity could be available.

It does ... but a bio can be as small as a single page.

quoted

I think the question here is: Do filesystems need the pagecache to
record which pages have seen an IO error?

It's not just filesystems.  The partition code uses PageError() ... the
metadata code might as well (those are things with no mapping).  I'm
not saying we can't remove PG_error; I am saying it's not going to be
quite as simple as using the AS_ flags.

The partition code could use PageUptodate().
mpage_end_io() calls page_endio() on each page, and on read error that
calls:

			ClearPageUptodate(page);
			SetPageError(page);

are both of these necessary?

fs/buffer.c can use several bios to read a single page.
If any one returns an error, PG_error is set.  When all of them have
completed, if PG_error is clear, PG_uptodate is then set.
This is an opportunistic use of PG_error, rather than an essential use.
It could be "fixed", and would need to be fixed if we were to deprecate
use of PG_error for read errors.
There are probably other usages like this.

Thanks,
NeilBrown

James

quoted

I think that for write errors, there is no value in recording
block-oriented error status - only file-oriented status.
For read errors, it might if help to avoid indefinite read retries, 
but I don't know the code well enough to be sure if this is an issue.

NeilBrown

quoted

James

quoted

The main argument I could see for keeping it is that removing it 
might subtly change the behavior of sync_file_range if you have 
tasks syncing different ranges in a file concurrently. I'm not 
sure if that would break any guarantees though.

Even if we do need it, I think we might need some cleanup here 
anyway. A lot of readpage operations end up setting that flag 
when they hit an error. Isn't it wrong to return an error on 
fsync, just because we had a read error somewhere in the file in 
a range that was never dirtied?

--
[1]: there is another place in f2fs, but it's more or less 
equivalent to the call site in __filemap_fdatawait_range.

Attachments

signature.asc [application/pgp-signature] 832 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help