Re: Too many ENOSPC errors

From: Trond Myklebust <hidden>
Date: 2023-06-12 19:53:19

On Mon, 2023-06-12 at 15:17 -0400, Jeff Layton wrote:

On Mon, 2023-06-12 at 13:49 -0400, Chris Perl wrote:

quoted

On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton [off-list ref]
wrote:

quoted

On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:

quoted

Got it: I think I see what's happening. filemap_sample_wb_err
just calls
errseq_sample, which does this:

errseq_t errseq_sample(errseq_t *eseq)
{
        errseq_t old = READ_ONCE(*eseq);

        /* If nobody has seen this error yet, then we can be
the first. */
        if (!(old & ERRSEQ_SEEN))
                old = 0;
        return old;
}

Because no one has seen that error yet (ERRSEQ_SEEN is clear),
the write
ends up being the first to see it and it gets back a 0, even
though the
error happened before the sample.

The above behavior is what we want for the sample that we do at
open()
time, but not what's needed for this use-case. We need a new
helper that
samples the value regardless of whether it has already been
seen:

errseq_t errseq_peek(errseq_t *eseq)
{
      return READ_ONCE(*eseq);
}

...but we'll also need to fix up errseq_check to handle
differences
between the SEEN bit.

I'll see if I can spin up a patch for that. Stay tuned.

This may not be fixable with the way that NFS is trying to use
errseq_t.

The fundamental problem is that we need to mark the errseq_t in
the
mapping as SEEN when we sample it, to ensure that a later error
is
recorded and not ignored.

But...if the error hasn't been reported yet and we mark it SEEN
here,
and then a later error doesn't occur, then a later open won't
have its
errseq_t set to 0, and that unseen error could be lost.

It's a bit of a pity: as originally envisioned, the errseq_t
mechanism
would provide for this sort of use case, but we added this patch
not
long after the original code went in, and it changed those
semantics:

    b4678df184b3 errseq: Always report a writeback error once

I don't see a good way to do this using the current errseq_t
mechanism,
given these competing needs. I'll keep thinking about it though.
Maybe
we could add some sort of store and forward mechanism for fsync
on NFS?
That could get rather complex though.

Can/should it be marked SEEN when the initial close(2) from tee(1)
reports the error?

No. Most software doesn't check for errors on close(), and for good
reason: there's no requirement that any data be written back before
close() returns. A successful return is meaningless.

It turns out that NFSv4 (usually) writes back the data before a close
returns, but you don't want to rely on that.

quoted

Part of the reason I had originally asked about `nfs_file_flush'
(i.e.
what close(2) calls) using `file_check_and_advance_wb_err' instead
of
`filemap_check_wb_err' was because I was drawn to comparing
`nfs_file_flush' against `nfs_file_fsync' as it seems like in the
3.10
based EL7 kernels, the former used to delegate to the latter (by
way
of `vfs_fsync') and so they had consistent behavior, whereas now
they
do not.

I think the problem is in some of the changes to write that have come
into play since then. They tried to use errseq_t to track errors over
a
small window, but the underlying infrastructure is not quite suited
for
that at the moment.

I think we can get there though by carving another flag bit out of
the
counter in the errseq_t. I'm working on a patch for that now.

The current NFS client code tries to do its best to match the
description in the manpages for how errors are reported: we try to
report them exactly once, either in write() or fsync().
We do still return errors on close(), but that kind of opportunistic
error return makes sure to use filemap_check_wb_err() so that we don't
break the write() + fsync() documented semantics.

The issue of picking up errors using errseq_sample() before even any
I/O has been attempted has been raised before, but AFAIK, the current
behaviour does actually match the promises made in the manpages, and it
matches what can happen with other filesystems.
I don't want to special case the NFS client, because that just leads to
people getting confused as to whether or not it will work correctly
with applications such as postgresql.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help