Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

From: Vivek Goyal <vgoyal@redhat.com>
Date: 2021-01-04 15:53:19
Also in: linux-fsdevel, lkml

On Mon, Dec 28, 2020 at 05:51:06PM +0200, Amir Goldstein wrote:

On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton [off-list ref] wrote:

quoted

On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:

quoted

On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox [off-list ref] wrote:

quoted

On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:

quoted

In current master, syncfs() on any file by any container user will
result in full syncfs() of the upperfs, which is very bad for container
isolation. This has been partly fixed by Chengguang Xu [1] and I expect
his work will be merged soon. Overlayfs still does not do the writeback
and syncfs() in overlay still waits for all upper fs writeback to complete,
but at least syncfs() in overlay only kicks writeback for upper fs files
dirtied by this overlay.

[1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/ (local)

Sharing the same SEEN flag among thousands of containers is also
far from ideal, because effectively this means that any given workload
in any single container has very little chance of observing the SEEN flag.

Perhaps you misunderstand how errseq works.  If each container samples
the errseq at startup, then they will all see any error which occurs
during their lifespan

Meant to say "...very little chance of NOT observing the SEEN flag",
but We are not in disagreement.
My argument against sharing the SEEN flag refers to Vivek's patch of
stacked errseq_sample()/errseq_check_and_advance() which does NOT
sample errseq at overlayfs mount time. That is why my next sentence is:
"I do agree with Matthew that overlayfs should sample errseq...".

quoted

(and possibly an error which occurred before they started up).

Right. And this is where the discussion of splitting the SEEN flag started.
Some of us want to treat overlayfs mount time as a true epoc for errseq.
The new container didn't write any files yet, so it should not care about
writeback errors from the past.

I agree that it may not be very critical, but as I wrote before, I think we
should do our best to try and isolate container workloads.

quoted

To this end, I do agree with Matthew that overlayfs should sample errseq
and the best patchset to implement it so far IMO is Jeff's patchset [2].
This patch set was written to cater only "volatile" overlayfs mount, but
there is no reason not to use the same mechanism for regular overlay
mount. The only difference being that "volatile" overlay only checks for
error since mount on syncfs() (because "volatile" overlay does NOT
syncfs upper fs) and regular overlay checks and advances the overlay's
errseq sample on syncfs (and does syncfs upper fs).

Matthew, I hope that my explanation of the use case and Jeff's answer
is sufficient to understand why the split of the SEEN flag is needed.

[2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/ (local)

No, it still feels weird and wrong.

All right. Considering your reservations, I think perhaps the split of the
SEEN flag can wait for a later time after more discussions and maybe
not as suitable for stable as we thought.

I think that for stable, it would be sufficient to adapt Surgun's original
syncfs for volatile mount patch [1] to cover the non-volatile case:
on mout:
- errseq_sample() upper fs
- on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
on syncfs:
- errseq_check() for volatile mount
- errseq_check_and_advance() for non-volatile mount
- errseq_set() overlay sb on upper fs error

Now errseq_set() is not only a hack around __sync_filesystem ignoring
return value of ->sync_fs(). It is really needed for per-overlay SEEN
error isolation in the non-volatile case.

Unless I am missing something, I think we do not strictly need Vivek's
1/3 patch [2] for stable, but not sure.

Sargun,

Do you agree with the above proposal?
Will you make it into a patch?

Vivek, Jefff,

Do you agree that overlay syncfs observing writeback errors that predate
overlay mount time is an issue that can be deferred (maybe forever)?

That's very application dependent.

To be clear, the main thing you'll lose with the method above is the
ability to see an unseen error on a newly opened fd, if there was an
overlayfs mount using the same upper sb before your open occurred.

IOW, consider two overlayfs mounts using the same upper layer sb:

ovlfs1                          ovlfs2
----------------------------------------------------------------------
mount
open fd1
write to fd1
<writeback fails>
                                mount (upper errseq_t SEEN flag marked)
open fd2
syncfs(fd2)
syncfs(fd1)


On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
calls. The first one has a sample from before the error occurred, and
the second one has a sample of 0, due to the fact that the error was
unseen at open time.

On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
return an error and syncfs(fd2) will not. If we split the SEEN flag into
two, then we can ensure that they both still get an error in this
situation.

Either I am not following or we are not talking about the same solution.
IMO it is perfectly fine the ovlfs2 mount consumes the unseen error
on upper fs, because there is no conceptual difference between an overlay
mount and any application that does syncfs().

I think currently mount() does not consume writeback errors on upper
filesystem. So if mount() of overlayfs notices an unseen error on
upper, and fails mount (without consuming upper sb error), then it should
be fine. We discussed that in past as well. You proposed it.

This will mean that user will need to call syncfs() on upper to clear
unseen error and then retry overlay mount. And this will not necessarily
need splitting SEEN flag. 

This requires calling syncfs() on upper in case of unseen error. And
Sargun wanted to avoid syncfs() on upper completely.

Vivek

However, in the situation that you describe, open fd2 is NOT supposed
to sample upper fs errseq at all, it is supposed to sample ovlfs1's
sb->s_wb_err.

syncfs(fd2) reports an error because ovl_check_sync(ofs) sees that
ofs->upper_errseq sample is older than upper fs errseq.

If ovlfs1 is volatile, syncfs(fd1), returns an error for the same reason,
because we never advance ofs->upper_errseq.

If ovlfs1 is non-volatile, the first syncfs(fd2) calls ovl_check_sync(ofs)
that observes the upper fs error which is newer than ofs->upper_errseq
and advances ofs->upper_errseq.
But it also calls errseq_set(&sb->s_wb_err) (a.k.a. "the hack"),
This will happen regardless of upper fs SEEN flag set by ovlfs2 mount.

The second syncfs(fd1) will also return an error because vfs had sampled
an old errseq value of ovlfs1 sb->s_wb_err, before the call to errseq_set().

What am I missing?

Thanks,
Amir.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help