Re: How capacious and well-indexed are ext4, xfs and btrfs directories?

From: "Darrick J. Wong" <djwong@kernel.org>
Date: 2021-05-25 04:21:40
Also in: linux-btrfs, linux-fsdevel, linux-xfs

On Sat, May 23, 2021 at 10:51:02PM -0700, Josh Triplett wrote:

On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:

quoted

On May 17, 2021, at 9:06 AM, David Howells [off-list ref] wrote:

quoted

With filesystems like ext4, xfs and btrfs, what are the limits on directory
capacity, and how well are they indexed?

The reason I ask is that inside of cachefiles, I insert fanout directories
inside index directories to divide up the space for ext2 to cope with the
limits on directory sizes and that it did linear searches (IIRC).

For some applications, I need to be able to cache over 1M entries (render
farm) and even a kernel tree has over 100k.

What I'd like to do is remove the fanout directories, so that for each logical
"volume"[*] I have a single directory with all the files in it.  But that
means sticking massive amounts of entries into a single directory and hoping
it (a) isn't too slow and (b) doesn't hit the capacity limit.

Ext4 can comfortably handle ~12M entries in a single directory, if the
filenames are not too long (e.g. 32 bytes or so).  With the "large_dir"
feature (since 4.13, but not enabled by default) a single directory can
hold around 4B entries, basically all the inodes of a filesystem.

ext4 definitely seems to be able to handle it. I've seen bottlenecks in
other parts of the storage stack, though.

With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
a few million entries (each pointing to a ~4-8k file) take the better
part of an hour, almost all of it system time in iowait. Also makes any
other concurrent disk writes hang, even a simple "touch x". Turning off
discard speeds it up by several orders of magnitude.

Synchronous discard is slow, even on NVME.

Background discard (aka fstrim in a cron job) isn't quite as bad, at
least in the sense of amortizing a bunch of clearing over an entire week
of not issuing discards. :P

--D

(I don't know if this is a known issue or not, so here are the details
just in case it isn't. Also, if this is already fixed in a newer kernel,
my apologies for the outdated report.)

$ uname -a
Linux s 5.10.0-6-amd64 #1 SMP Debian 5.10.28-1 (2021-04-09) x86_64 GNU/Linux

Reproducer (doesn't take *as* long but still long enough to demonstrate
the issue):
$ mkdir testdir
$ time python3 -c 'for i in range(1000000): open(f"testdir/{i}", "wb").write(b"test data")'
$ time rm -r testdir

dmesg details:

INFO: task rm:379934 blocked for more than 120 seconds.
      Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rm              state:D stack:    0 pid:379934 ppid:379461 flags:0x00004000
Call Trace:
 __schedule+0x282/0x870
 schedule+0x46/0xb0
 wait_transaction_locked+0x8a/0xd0 [jbd2]
 ? add_wait_queue_exclusive+0x70/0x70
 add_transaction_credits+0xd6/0x2a0 [jbd2]
 start_this_handle+0xfb/0x520 [jbd2]
 ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
 ? kmem_cache_alloc+0xed/0x1f0
 jbd2__journal_start+0xf7/0x1e0 [jbd2]
 __ext4_journal_start_sb+0xf3/0x110 [ext4]
 ext4_evict_inode+0x24c/0x630 [ext4]
 evict+0xd1/0x1a0
 do_unlinkat+0x1db/0x2f0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f088f0c3b87
RSP: 002b:00007ffc8d3a27a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
RAX: ffffffffffffffda RBX: 000055ffee46de70 RCX: 00007f088f0c3b87
RDX: 0000000000000000 RSI: 000055ffee46df78 RDI: 0000000000000004
RBP: 000055ffece9daa0 R08: 0000000000000100 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc8d3a2980 R14: 00007ffc8d3a2980 R15: 0000000000000002
INFO: task touch:379982 blocked for more than 120 seconds.
      Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch           state:D stack:    0 pid:379982 ppid:379969 flags:0x00000000
Call Trace:
 __schedule+0x282/0x870
 schedule+0x46/0xb0
 wait_transaction_locked+0x8a/0xd0 [jbd2]
 ? add_wait_queue_exclusive+0x70/0x70
 add_transaction_credits+0xd6/0x2a0 [jbd2]
 ? xas_load+0x5/0x70
 ? find_get_entry+0xd1/0x170
 start_this_handle+0xfb/0x520 [jbd2]
 ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
 ? kmem_cache_alloc+0xed/0x1f0
 jbd2__journal_start+0xf7/0x1e0 [jbd2]
 __ext4_journal_start_sb+0xf3/0x110 [ext4]
 __ext4_new_inode+0x721/0x1670 [ext4]
 ext4_create+0x106/0x1b0 [ext4]
 path_openat+0xde1/0x1080
 do_filp_open+0x88/0x130
 ? getname_flags.part.0+0x29/0x1a0
 ? __check_object_size+0x136/0x150
 do_sys_openat2+0x97/0x150
 __x64_sys_openat+0x54/0x90
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb2afb8fbe7
RSP: 002b:00007ffee3e287b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007ffee3e28a68 RCX: 00007fb2afb8fbe7
RDX: 0000000000000941 RSI: 00007ffee3e2a340 RDI: 00000000ffffff9c
RBP: 00007ffee3e2a340 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007ffee3e2a340 R14: 0000000000000000 R15: 0000000000000000

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help