Re: After unlinking a large file on ext4, the process stalls for a long time
From: Lukáš Czerner <hidden>
Date: 2014-07-17 10:40:16
Also in:
linux-fsdevel
On Thu, 17 Jul 2014, Mason wrote:
Date: Thu, 17 Jul 2014 12:30:34 +0200 From: Mason <redacted> To: Andreas Dilger <redacted> Cc: Ext4 Developers List <redacted>, linux-fsdevel [off-list ref] Subject: Re: After unlinking a large file on ext4, the process stalls for a long time Hello, Andreas Dilger wrote:quoted
Mason wrote:quoted
The use case is - allocate a large file - stick a file system on it - store stuff (typically video files) inside this "private" FS - when the user decides he doesn't need it anymore, unmount and unlink (I also have a resize operation in there, but I wanted to get the basics before taking the hard stuff head on.) So, in the limit, we don't store anything at all: just create and immediately delete. This was my test.I would agree that LVM is the real solution that you want to use. It is specifically designed for this, and has much less overhead than a filesystem on a loopback device on a file on another filesystem. The amount of space overhead is tuneable, but typically the volumes are allocated in multiples of 4MB chunks.I'll take a look at LVM. (But, at this point, it's too late to change the architecture of the system.)quoted
That said, I think you've found some kind of strange performance problem, and it is worthwhile to figure this out.quoted
quoted
quoted
/tmp # time ./foo /mnt/hdd/xxx 5 posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms] unlink(filename): 0 [0 ms] 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps /tmp # time ./foo /mnt/hdd/xxx 10 posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms] unlink(filename): 0 [0 ms] 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps /tmp # time ./foo /mnt/hdd/xxx 100 posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms] unlink(filename): 0 [0 ms] 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps /tmp # time ./foo /mnt/hdd/xxx 300 posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms] unlink(filename): 0 [0 ms] 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swapsPreliminary info: The partition was created/mounted with $ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1 $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime (mount is busybox, in case it matters) mke2fs 1.42.10 (18-May-2014) /dev/sda1 contains a ext4 file system labelled 'ZOZO' last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014 Proceed anyway? (y,n) y Creating filesystem with 104857600 4k blocks and 460800 inodes Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1) /* No support for xattr in this kernel */ # dumpe2fs -h /dev/sda1 dumpe2fs 1.42.10 (18-May-2014) Filesystem volume name: ZOZO Last mounted on: <not available> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: not clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 460800 Block count: 104857600 Reserved block count: 0 Free blocks: 104803944 Free inodes: 460789 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 999 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 144 Inode blocks per group: 9 Flex block group size: 16 Filesystem created: Thu Jul 17 11:14:27 2014 Last mount time: Thu Jul 17 11:14:29 2014 Last write time: Thu Jul 17 11:14:29 2014 Mount count: 1 Maximum mount count: -1 Last checked: Thu Jul 17 11:14:27 2014 Check interval: 0 (<none>) Lifetime writes: 4883 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group unknown) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Default directory hash: half_md4 Directory Hash Seed: 157f2107-76fc-417b-9a07-491951c873b7quoted
Firstly, have you tried using "fallocate()" directly, instead of posix_fallocate()? It may be (depending on your userspace) that posix_fallocate() is writing zeroes to the file instead of using the fallocate() syscall, and the kernel is busy cleaning up all of the dirty pages when the file is unlinked. You could try using strace to see what system calls are actually being used.Unfortunately, I'm using a prehistoric version of glibc (2.8) that doesn't support the fallocate wrapper (imported in 2.10). I'm 70% sure that posix_fallocate() is not actually writing zeros to the file, because when I tested it on ext2, creating a 300-GB file took hours, literally (approx. 3 hours). The same operation on ext4 takes a few seconds. (Although, now that I think of it, it could be working asynchronously, or defer some operation, that I eventually have to pay for on deletion.) # time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms] unlink(filename): 0 [1 ms] 12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486> 12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105> 12:23:27.221029 SYS_4320() = 0 <0.412013> 12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104> 12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116> 12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146> 12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143> 12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237> 12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102> 12:23:27.640168 unlink("/mnt/hdd/xxx") = 0 <0.000231> 12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100> 12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157> 12:23:27.643056 exit_group(0) = ? 0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k 0inputs+0outputs (0major+459minor)pagefaults 0swaps
So it really does not seem to be stalling in fallocate, nor unlink. Can you add close() before unlink, just to be sure what's happening there ? Thanks! -Lukas
AFAICT, SYS_4320() is fallocate. /* * Linux o32 style syscalls are in the range from 4000 to 4999. */ #define __NR_Linux 4000 #define __NR_fallocate (__NR_Linux + 320) Where is the process stalling? That is a mystery. Seems it's stuck in exit_group(), waiting for the kernel to clean up on its behalf? Maybe I need ftrace, or something to profile the kernel?quoted
Secondly, where is the process actually stuck? From your output above, the unlink() call takes no measurable time before returning, so I don't see where it is actually stuck. Again, running your test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which syscall is actually taking so much time to complete. I don't think it is unlink().See above, the process is stalled, but I don't know where!