Re: [PATCH] xfs: fix incorrect log_flushed on fsync

From: Amir Goldstein <amir73il@gmail.com>
Date: 2017-09-18 18:00:30
Also in: linux-fsdevel, linux-xfs

On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
[off-list ref] wrote:

On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:

quoted

On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein [off-list ref] wrote:

quoted

When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH]
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <redacted>
Cc: <redacted>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Christoph, Dave,

It's hard to believe, but I think the reported bug has been around
since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
not try to test old kernels.

Forgot to tag commit message with:
Fixes: f538d4da8d52 ("[XFS] write barrier support")

Maybe the tag could be added when applying to recent stables,
so distros and older downstream stables can see the tag.

The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
if possible data loss bug should also be disclosed in some distros forum?
I bet some users would care more about the latter than the former.
Coincidentally, both data loss and security bugs fix the same commit..

Yes the the patch ought to get sent on to stable w/ fixes tag.  One
would hope that the distros will pick up the stable fixes from there.


Greg, for your consideration, please add
 Fixes: f538d4da8d52 ("[XFS] write barrier support")
If not pushed yet.

That said, it's been in the kernel for 12 years without widespread
complaints about corruption, so I'm not sure this warrants public
disclosure via CVE/Phoronix vs. just fixing it.

I'm not sure either.
My intuition tells me that the chances of hitting the data loss bug
given a power failure are not slim, but the chances of users knowing
about the data loss are slim.
Meaning, the chances of users complaining:
"I swear, I did fsync, lost power, and after boot data was not there" are slim
and the chances of developers believing the report on hear say are
even slimmer.

Oh well. I was just wondering...

Amir.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help