Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency

From: Vijay Chidambaram <hidden>
Date: 2017-08-16 19:07:27
Also in: linux-btrfs, linux-fsdevel, linux-xfs

Hi Josef,

Thank you for the detailed reply -- I think it provides several
pointers for our future work. It sounds like we have a similar vision
for where we want this to go, though we may disagree about how to
implement this :) This is exciting!

I agree that we should be building off existing work if it is a good
option. We might end up using log-writes, but for now we see several
problems:

- The log-writes code is not documented well. As you have mentioned,
at this point, only you know how it works, and we are not seeing a lot
of adoption by other developers of log-writes as well.

- I don't think our requirements exactly match what log-writes
provides. For example, at some point we want to introduce checkpoints
so that we can co-relate a crash state with file-system state at the
time of crash. We also want to add functionality to guide creation of
random crash states (see below). This might require changing
log-writes significantly. I don't know if that would be a good idea.

Regarding random crashes, there is a lot of complexity there that
log-writes couldn't handle without significant changes. For example,
just randomly generating crash states and testing each state is
unlikely to catch bugs. We need a more nuanced way of doing this. We
plan to add a lot of functionality to CrashMonkey to (a) let the user
guide crash-state generation (b) focus on "interesting" states (by
re-ordering or dropping metadata). All of this will likely require
adding more sophistication to the kernel module. I don't think we want
to take log-writes and add a lot of extra functionality.

Regarding logging writes, I think there is a difference in approach
between log-writes and CrashMonkey. We don't really care about the
completion order since the device may anyway re-order the writes after
that point. Thus, the set of crash states generated by CrashMonkey is
bound only by FUA and FLUSH flags. It sounds as if log-writes focuses
on a more restricted set of crash states.

CrashMonkey works with the 4.4 kernel, and we will try and keep up
with changes to the kernel that breaks CrashMonkey. CrashMonkey is
useless without the user-space component, so users will be needing to
compile some code anyway. I do not believe it will matter much whether
it is in-tree or not, as long as it compiles with the latest kernel.

Regarding discard, multi-device support, and application-level crash
consistency, this is on our road-map too! Our current priority is to
build enough scaffolding to reproduce a known crash-consistency bug
(such as the delayed allocation bug of ext4), and then go on and try
to find new bugs in newer file systems like btrfs.

Adding CrashMonkey into the kernel is not a priority at this point (I
don't think CrashMonkey is useful enough at this point to do so). When
CrashMonkey becomes useful enough to do so, we will perhaps add the
device_wrapper as a DM target to enable adoption.

Our hope currently is that developers like Ari will try out
CrashMonkey in its current form, which will guide us as to what
functionality to add to CrashMonkey to find bugs more effectively.

Thanks,
Vijay

On Wed, Aug 16, 2017 at 8:06 AM, Josef Bacik [off-list ref] wrote:

On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote:

quoted

Hi Amir,

I neglected to mention this earlier: CrashMonkey does not require
recompiling the kernel (it is a stand-alone kernel module), and has
been tested with the kernel 4.4. It should work with future kernel
versions as long as there are no changes to the bio structure.

As it is, I believe CrashMonkey is compatible with the current kernel.
It certainly provides functionality beyond log-writes (the ability to
replay a subset of writes between FLUSH/FUA), and we intend to add
more functionality in the future.

Right now, CrashMonkey does not do random sampling among possible
crash states -- it will simply test a given number of unique states.
Thus, right now I don't think it is very effective in finding
crash-consistency bugs. But the entire infrastructure to profile a
workload, construct crash states, and test them with fsck is present.

I'd be grateful if you could try it and give us feedback on what make
testing easier/more useful for you. As I mentioned before, this is a
work-in-progress, so we are happy to incorporate feedback.

Sorry I was travelling yesterday so I couldn't give this my full attention.
Everything you guys do is already accomplished with dm-log-writes.  If you look
at the example scripts I've provided

https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh

The first initiates the replay, and points at the second script to run after
each entry is replayed.  The whole point of this stuff was to make it as
flexible as possible.  The way we use it is to replay, create a snapshot of the
replay, mount, unmount, fsck, delete the snapshot and carry on to the next
position in the log.

There is nothing keeping us from generating random crash points, this has been
something on my list of things to do forever.  All that would be required would
be to hold the entries between flush/fua events in memory, and then replay them
in whatever order you deemed fit.  That's the only functionality missing from my
replay-log stuff that CrashMonkey has.

The other part of this is getting user space applications to do more thorough
checking of consistency that it expects, which I implemented here

https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07

fsx will randomly do operations to a file, and every time it fsync()'s it saves
it's state and marks the log.  Then we can go back and replay the log to the
mark and md5sum the file to make sure it matches the saved state.  This
infrastructure was meant to be as simple as possible so the possiblities for
crash consistency testing were endless.  One of the next areas we plan to use
this in Facebook is just for application consistency, so we can replay the fs
and verify the application works in whatever state the fs is at any given point.

I looked at your code and you are logging entries at submit time, not completion
time.  The reason I do those crazy acrobatics is because we have had bugs in
previous kernels where we were not waiting for io completion of important
metadata before writing out the super block, so logging only at completion
allows us to catch that class of problems.

The other thing CrashMonkey is missing is DISCARD support.  We fuck up discard
support constantly, and being able to replay discards to make sure we're not
discarding important data is very important.

I'm not trying to shit on your project, obviously it's a good idea, that's why I
did it years ago ;).  The community is going to use what is easiest to use, and
modprobe dm-log-writes is a lot easier than compiling and insmod'ing an out of
tree driver.  Also your driver won't work on upstream kernels because of the way
the bio flags were changed recently, which is why we prefer using upstream
solutions.

If you guys want to get this stuff used then it would be better at this point to
build on top of what we already have.  Just off the top of my head we need

1) Random replay support for replay-log.  This is probably a day or two worth of
work for a student.

2) Documentation, because right now I'm the only one who knows how this works.

3) My patches need to actually be pushed into upstream fstests.  This would be
the largest win because then all the fs developers would be running the tests
by default.

4) Multi-device support.  One thing that would be good to have and is a dream of
mine is to connect multiple devices to one log, so we can do things like make
sure mdraid or btrfs's raid consistency.  We could do super evil things like
only replay one device, or replay alternating writes on each device.  This would
be a larger project but would be super helpful.

Thanks,

Josef

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help