Re: Too many ENOSPC errors

From: Jeff Layton <jlayton@kernel.org>
Date: 2023-06-08 20:50:31
Subsystem: filesystems (vfs and infrastructure), nfs, sunrpc, and lockd clients, the rest · Maintainers: Alexander Viro, Christian Brauner, Trond Myklebust, Anna Schumaker, Linus Torvalds

On Thu, 2023-06-08 at 13:05 -0400, Chris Perl wrote:

Hi everyone,

I'm working with several Red Hat derived systems and have noticed an
issue with ENOSPC and NFS that I'm looking for some guidance on.

First let me describe the testing setup, and then I'll share my
results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
system but with an upstream 6.1 kernel.

Assume I have a 20M quota on my current working directory which is an
NFS export from one of the major enterprise vendors.

The testing looks like the following:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc

On an EL7 based system, running the above works just as shown. I.e.
you create file1, then use all the quota with file2, attempt to write
to file1 which fails with ENOSPC (as expected), remove file2 (which
frees up the quota), and then attempt to write to file1 again which
succeeds.

However, on a stock EL8 based system, I instead get the following
surprising behavior:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc
tee: file1: No space left on device

I.e. Even after freeing the space by removing file2, writing to file1
continues to fail with ENOSPC forever (I've only shown 2 iterations
above) [1]. No amount of waiting will cause it to go away. But, we've
found that running sync(1) on the file will fix it (the sync itself
will complain with ENOSPC, but then subsequent tee invocatinos
succeed).

I thought that perhaps the issue was the fact that kernel
4.18.0-425.19.2.el8_7 was missing commit
e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
handling to `nfs_file_write'), so we patched the kernel with that
patch and tested again. It's worth saying that with this patch, the
behavior of our 4.18 kernel and the 6.1 kernel are consistent when
running this test, but I feel like there might still be a bug here.

What I get now is:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc

I.e. The first attempt to write to the file after freeing the quota
fails with ENOSPC, but subsequent attempts succeed. Note that this is
different from the original behavior on our EL7 based system as shown
above where as soon as the quota is freed up, there are no more ENOSPC
errors.

I'm no expert, but below I'm including some digging I did in case it's
helpful for understanding the situation more fully without needing to
reproduce yourselves. If it's not helpful and just distracting,
apologies in advance!

From strace'ing and systemtap'ing I noticed that the first call to
`tee' (after the quota is used up by file2) returns the ENOSPC in
response to close(2) (i.e. via `nfs_file_flush') and the second call

That is (unfortunately) expected behavior. I've argued (mostly
unsuccessfully) for years that we shouldn't return writeback errors in
the close() codepath.

No program should rely on looking for those. The only "legit" error on
close() is -EBADF.

to `tee' (after the quota has been freed) returns the ENOSPC in
response to the write(2) (i.e. via `nfs_file_write' , and then clears
the error via the changes we introduced with commit
e6005436f6cc9ed13288f936903f0151e5543485).

Looking at nfs_file_write, it's already tracking errors itself during
the write. Does this patch fix that? Note that I've not tested this --
YMMV!

----------------------8------------------------

[RFC PATCH] nfs: ignore the error from generic_write_sync

In the write codepath, we're only interested in writeback errors that
occur after the point where the write has started. It's possible though
that there were previous errors stored in the mapping before the write
ever began, in which case generic_write_sync will return error.

We already track errors over the part we're interested in, so we can
safely discard errors from generic_write_sync.

Reported-by: Chris Perl <redacted>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/file.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index f0edf5a36237..3ca1ffb1245e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c

@@ -673,10 +673,14 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 					iocb->ki_pos - written,
 					iocb->ki_pos - 1);
 	}
-	result = generic_write_sync(iocb, written);
-	if (result < 0)
-		return result;
 
+	/*
+	 * For a write, we're only interested in errors that occur
+	 * after the point where we sample the wb_error. Ignore
+	 * errors from generic_write_sync, which may have occurred
+	 * before that point.
+	 */
+	generic_write_sync(iocb, written);
 out:
 	/* Return error values */
 	error = filemap_check_wb_err(file->f_mapping, since);

-- 
2.40.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help