Re: About the md-bitmap behavior
From: Qu Wenruo <hidden>
Date: 2022-06-23 00:53:37
Also in:
linux-block
On 2022/6/23 07:00, Song Liu wrote:
On Wed, Jun 22, 2022 at 3:33 PM NeilBrown [off-list ref] wrote:quoted
On Wed, 22 Jun 2022, Qu Wenruo wrote:quoted
On 2022/6/22 10:15, Doug Ledford wrote:quoted
On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote:quoted
On 20/06/2022 08:56, Qu Wenruo wrote:quoted
quoted
The write-hole has been addressed with journaling already, and this will be adding a new and not-needed feature - not saying it wouldn't be nice to have, but do we need another way to skin this cat?I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a completely different thing. Here I'm just trying to understand how the md-bitmap works, so that I can do a proper bitmap for btrfs RAID56.Ah. Okay. Neil Brown is likely to be the best help here as I believe he wrote a lot of the code, although I don't think he's much involved with md- raid any more.I can't speak to how it is today, but I know it was *designed* to be sync flush of the dirty bit setting, then lazy, async write out of the clear bits. But, yes, in order for the design to be reliable, you must flush out the dirty bits before you put writes in flight.Thank you very much confirming my concern. So maybe it's me not checking the md-bitmap code carefully enough to expose the full picture.quoted
One thing I'm not sure about though, is that MD RAID5/6 uses fixed stripes. I thought btrfs, since it was an allocation filesystem, didn't have to use full stripes? Am I wrong about that?Unfortunately, we only go allocation for the RAID56 chunks. In side a RAID56 the underlying devices still need to go the regular RAID56 full stripe scheme. Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs RAID56 chunk, but without bitmap/journal.quoted
Because it would seem that if your data isn't necessarily in full stripes, then a bitmap might not work so well since it just marks a range of full stripes as "possibly dirty, we were writing to them, do a parity resync to make sure".For the resync part is where btrfs shines, as the extra csum (for the untouched part) and metadata COW ensures us only see the old untouched data, and with the extra csum, we can safely rebuild the full stripe. Thus as long as no device is missing, a write-intent-bitmap is enough to address the write hole in btrfs (at least for COW protected data and all metadata).quoted
In any case, Wols is right, probably want to ping Neil on this. Might need to ping him directly though. Not sure he'll see it just on the list.Adding Neil into this thread. Any clue on the existing md_bitmap_startwrite() behavior?md_bitmap_startwrite() is used to tell the bitmap code that the raid module is about to start writing at a location. This may result in md_bitmap_file_set_bit() being called to set a bit in the in-memory copy of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. Before raid actually submits the writes to the device it will call md_bitmap_unplug() which will submit the writes and wait for them to complete. The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" which says a few things about how raid5 ensure things happen in the right order. However I don't think if any sort of bitmap can solve the write-hole problem for RAID5 - even in btrfs. The problem is that if the host crashes while the array is degraded and while some write requests were in-flight, then you might have lost data. i.e. to update a block you must write both that block and the parity block. If you actually wrote neither or both, everything is fine. If you wrote one but not the other then you CANNOT recover the data that was on the missing device (there must be a missing device as the array is degraded). Even having checksums of everything is not enough to recover that missing block. You must either: 1/ have a safe duplicate of the blocks being written, so they can be recovered and re-written after a crash. This is what journalling does. Or 2/ Only write to location which don't contain valid data. i.e. always write full stripes to locations which are unused on each device. This way you cannot lose existing data. Worst case: that whole stripe is ignored. This is how I would handle RAID5 in a copy-on-write filesystem.Thanks Neil for explaining this. I was about to say the same idea, but couldn't phrase it well. md raid5 suffers from write hole because the mapping from array-LBA to component-LBA is fixed.
In fact, inside one btrfs RAID56 chunk, it's the same fixed logical->physical mapping. Thus we still have the problem.
As a result, we have to update the data in place. btrfs already has file-to-LBA mapping, so it shouldn't be too expensive to make btrfs free of write hole. (no need for maintain extra mapping, or add journaling).
Unfortunately, btrfs is not that flex yet. In fact, btrfs just does its mapping in a much smaller graduality. So in btrfs we have the following mapping scheme: 1G 2G 3G 4G Btrfs logical address space: | RAID1 | RAID5 | EMPTY | ... And logical address range [1G, 2G) is mapped using RAID1, using some physical ranges from 2 devices in the pool Logical address range [2G, 3G) is mapped using RAID5, using some physical ranges from several devices in the pool. Logical address range [3G, 4G) is not mapped, read/write that range would directly lead to -EIO. By this, you can see, btrfs is not as flex as you think. Yes, we have file -> logical address mapping, but inside each mapped logical address range, everything is still fixed mapping. If we want to really make extent allocator (which currently works at logical address level, no caring the underlying mapping at all) to avoid partial stripe write, it's a lot of cross-layer work. In fact, Johannes is working on an extra layer of mapping for RAID56, by that it can be possible to do extra mapping to avoid partial write. But that requires a lot of work, and may not even work for metadata. Thus I'm still exploring the tried-and-true methods like write-intent-bitmap and journal for btrfs RAID56. Thanks, Qu
Thanks, Songquoted
However, I see you wrote:quoted
Thus as long as no device is missing, a write-intent-bitmap is enough to address the write hole in btrfs (at least for COW protected data and all metadata).That doesn't make sense. If no device is missing, then there is no write hole. If no device is missing, all you need to do is recalculate the parity blocks on any stripe that was recently written. In md with use the write-intent-bitmap. In btrfs I would expect that you would already have some way of knowing where recent writes happened, so you can validiate the various checksums. That should be sufficient to recalculate the parity. I've be very surprised if btrfs doesn't already do this. So I'm somewhat confuses as to what your real goal is. NeilBrown