Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

From: Kirill Tkhai <hidden>
Date: 2020-03-02 11:08:20
Also in: linux-fsdevel, lkml

On 28.02.2020 18:35, Andreas Dilger wrote:

On Feb 27, 2020, at 5:24 AM, Kirill Tkhai [off-list ref] wrote:

quoted

On 27.02.2020 00:51, Andreas Dilger wrote:

quoted

On Feb 26, 2020, at 1:05 PM, Kirill Tkhai [off-list ref] wrote:

quoted

Why? There are two contradictory actions that filesystem can't do at the same time:

1)place files on a distance from each other to minimize number of extents
on possible future growth;
2)place small files in the same big block of block device.

At initial allocation time you never know, which file will stop grow in some
future, i.e. which file is suitable for compaction. This knowledge becomes
available some time later.  Say, if a file has not been changed for a month,
it is suitable for compaction with another files like it.

If at allocation time you can determine a file, which won't grow in the future,
don't be afraid, and just share your algorithm here.

Very few files grow after they are initially written/closed.  Those that
do are almost always opened with O_APPEND (e.g. log files).  It would be
reasonable to have O_APPEND cause the filesystem to reserve blocks (in
memory at least, maybe some small amount on disk like 1/4 of the current
file size) for the file to grow after it is closed.  We might use the
same heuristic for directories that grow long after initial creation.

1)Lets see on a real example. I created a new ext4 and started the test below:
https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99

Here are two files written. One file is 4Kb. One file is 1Mb-4Kb.

$filefrag -e test1.tmp test2.tmp
Filesystem type is: ef53
File size of test1.tmp is 4096 (1 block of 4096 bytes)
ext:     logical_offset:        physical_offset: length:   expected: flags:
  0:        0..       0:      33793..     33793:      1:             last,eof
test1.tmp: 1 extent found
File size of test2.tmp is 1044480 (255 blocks of 4096 bytes)
ext:     logical_offset:        physical_offset: length:   expected: flags:
  0:        0..     254:      33536..     33790:    255:             last,eof
test2.tmp: 1 extent found

The alignment of blocks in the filesystem is much easier to see if you use
"filefrag -e -x ..." to print the values in hex.  In this case, 33536 = 0x8300
so it is properly aligned on disk IMHO.

quoted

$debugfs:  testb 33791
Block 33791 not in use

test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp
placement prevents us from discarding next 1Mb block.

For most filesystem uses, aligning the almost 1MB file on a 1MB boundary
is good.  That allows a full-stripe read/write for RAID, and is more
likely to align with the erase block for flash.  If it were to be allocated
after the 4KB block, then it may be that each 1MB-aligned read/write of a
large file would need to read/write two unaligned chunks per syscall.

quoted

2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb:

# filefrag -e test3.tmp test4.tmp
Filesystem type is: ef53
File size of test3.tmp is 1052672 (257 blocks of 4096 bytes)
ext:     logical_offset:        physical_offset: length:   expected: flags:
  0:        0..     256:      35840..     36096:    257:             last,eof
test3.tmp: 1 extent found
File size of test4.tmp is 1044480 (255 blocks of 4096 bytes)
ext:     logical_offset:        physical_offset: length:   expected: flags:
  0:        0..     254:      35072..     35326:    255:             last,eof
test4.tmp: 1 extent found

Here again, "filefrag -e -x" would be helpful.  35840 = 0x8c00, and
35072 = 0x8900, so IMHO they are allocated properly for most uses.
Packing all files together sequentially on disk is what FAT did and
it always got very fragmented in the end.

quoted

They don't go sequentially, and here is fragmentation starts.

After both the tests:
$df -h
/dev/loop0      2.0G   11M  1.8G   1% /root/mnt

Filesystem is free, all last block groups are free. E.g.,

Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED]
 Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66
 Inode bitmap at 288 (bg #0 + 288), csum 0x00000000
 Inode table at 7969-8480 (bg #0 + 7969)
 32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
 Free blocks: 491520-524287
 Free inodes: 122881-131072

but two files are not packed together.

So, ext4 block allocator does not work good for my workload. It even does not
know anything about discard granuality of underlining block device. Does it?
I assume no fs knows. Should I tell it?

You can tune the alignment of allocations via s_raid_stripe and s_raid_stride
in the ext4 superblock.  I believe these are also set by mke2fs by libdisk,
but I don't know if it takes flash erase block geometry into account.

quoted

The main exception there is VM images, because they are not really "files"
in the normal sense, but containers aggregating a lot of different files,
each created with patterns that are not visible to the VM host.  In that
case, it would be better to have the VM host tell the filesystem that the
IO pattern is "random" and not try to optimize until the VM is cold.

quoted

In Virtuozzo we tried to compact ext4 with existing kernel interface:

https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c

But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave
excellent for everything.

If this interface bad, can you suggest another interface to make block
allocator to know the behavior expected from him in this specific case?

In ext4 there is already the "group" allocator, which combines multiple
small files together into a single preallocation group, so that the IO
to disk is large/contiguous.  The theory is that files written at the
same time will have similar lifespans, but that isn't always true.

If the files are large and still being written, the allocator will reserve
additional blocks (default 8MB I think) on the expectation that it will
continue to write until it is closed.

I think (correct me if I'm wrong) that your issue is with defragmenting
small files to free up contiguous space in the filesystem?  I think once
the free space is freed of small files that defragmenting large files is
easily done.  Anything with more than 8-16MB extents will max out most
storage anyway (seek rate * IO size).

My issue is mostly with files < 1Mb, because underlining device discard
granuality is 1Mb. The result of fragmentation is that size of occupied
1Mb blocks of device is 1.5 times bigger, than size of really written
data (say, df -h). And this is the problem.


Sure, and the group allocator will aggregate writes << prealloc size of
8MB by default.  If it is 1MB that doesn't qualify for group prealloc.
I think under 64KB does qualify for aggregation and unaligned writes.

quoted

In that case, an interesting userspace interface would be an array of
inode numbers (64-bit please) that should be packed together densely in
the order they are provided (maybe a flag for that).  That allows the
filesystem the freedom to find the physical blocks for the allocation,
while userspace can tell which files are related to each other.

So, this interface is 3-in-1:

1)finds a placement for inodes extents;

The target allocation size would be sum(size of inodes), which should
be relatively small in your case).

quoted

2)assigns this space to some temporary donor inode;

Maybe yes, or just reserves that space from being allocated by anyone.

quoted

3)calls ext4_move_extents() for each of them.

... using the target space that was reserved earlier

quoted

Do I understand you right?

Correct.  That is my "5 minutes thinking about an interface for grouping
small files together without exposing kernel internals" proposal for this.

Ok. I'll think about the prototype and then public to the mailing list.

quoted

If so, then IMO it's good to start from two inodes, because here may code
a very difficult algorithm of placement of many inodes, which may require
much memory. Is this OK?

Well, if the files are small then it won't be a lot of memory.  Even so,
the kernel would only need to copy a few MB at a time in order to get
any decent performance, so I don't think that is a huge problem to have
several MB of dirty data in flight.

I mean not in-flight IO, but memory for all logic of files placement.
Userspace may build multi-step algoritm, which is hidden for kernel:
pack two files together, then decrease number of extents of some third
file, then pack something else.

Also, files related to different directories should be packed together,
but it does not look good for kernel to look for files directories
by inodes (our interface is about 64-bit inodes numbers, sure?).

For me it does not look good, kernel iterates over all files and looks for
a placement for a specific file, since this is just excess work for kernel.
Usually, both the files are chosen by userspace, and the userspace does not
want to move more then one of them at time.

quoted

Can we introduce a flag, that some of inode is unmovable?

There are very few flags left in the ext4_inode->i_flags for use.
You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
so we might consider it a generic "do not merge" flag if set?

quoted

Can this interface use a knowledge about underlining device discard granuality?

As I wrote above, ext4+mballoc has a very good appreciation for alignment.
That was written for RAID storage devices, but it doesn't matter what
the reason is.  It isn't clear if flash discard alignment is easily
used (it may not be a power-of-two value or similar), but wouldn't be
harmful to try.

quoted

In the answer to Dave, I wrote a proposition to make fallocate() care about
i_write_hint. Could you please comment what you think about that too?

I'm not against that.  How the two interact would need to be documented
first and discussed to see if that makes sene, and then implemented.

Thanks,
Kirill

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help