Re: heavy xfsaild I/O blocking process exit
From: Momtchil Momtchev <hidden>
Date: 2021-09-09 14:26:06
I just found the problem - it is indeed the speculative allocation. The process is leaking file descriptors (Node.js raises the limit itself so leaks tend to remain hidden) - and when it is killed it starts going through tens even hundreds of thousands of descriptors. Thanks for all the tips. On 08/09/2021 23:27, Dave Chinner wrote:
On Wed, Sep 08, 2021 at 10:15:59AM +0200, Momtchil Momtchev wrote:quoted
Hello, I have a puzzling problem with XFS on Debian 10. I am running number-crunching driven by Node.js - I have a process that creates about 2 million 1MB to 5MB files per day with an about 24h lifespan (weather forecasting). The file system is obviously heavily fragmented. I have absolutely no problems when running this in cruise mode, but every time I decide to stop that process, especially when it has been running for a fewWhat does "stop that process" mean? You kill it, or do you run a stop command that tells the process to do a controlled shutdown?quoted
weeks or months, the process will become a zombie (freeing all its user memory and file descriptors) and then xfsaild/kworker will continue flushing the log for about 30-45 minutes before the process really quits.The xfsaild is not flushing the log. It's doing metadata writeback. If it is constantly busy, it means the log has run out of space and something else wants log space. That something else will block until the log space has been freed up by metadata writeback....quoted
It will keep its binds to network ports (which is my main problem) but the system will remain responsive and usable. The I/O pattern is several seconds of random reading then a second or two of sequential writing.That would be expected from final close on lots of dirty inodes or finalising unlinks on final close. But that won't stop anything else from functioning.quoted
The kernel functions that are running in the zombie process context are mainly xfs_btree_lookup, xfs_log_commit_cil, xfs_next_bit, xfs_buf_find_isra.26Full profiles (e.g. from perf top -U -p <pid>) would be useful here, but this sounds very much like extent removal on final close. This will be removing either speculative preallocation beyond EOF or the workload has open but unlinked files and the unlink is being done at process exit. Either way, if the files are fragmented into millions of extents, this could take minutes per file being closed. But with only 1-5MB files, that shouldn't be occurring...quoted
xfsaild is spending time in radix_tree_next_chunk, xfs_inode_buf_verifyxfsaild should never be doing radix tree lookups - it only works on internal in-memory filessytem objects that it has direct references to. IOWs, I really need to see the actual profile outputs to determine what it is doing... xfs_inode_buf_verify() is expected if it is writing back dirty inode clusters. Which it will be, but at only 2 million files a day I wouldn't expect that to show up in profiles at all. It doesn't really even show up in profiles even at half a million inodes per _second_quoted
kworker is in xfs_reclaim_inode, radix_tree_next_chunkWhich kworker is that? Likely background inode reclaim, but that doesn't limit anything - it just indicates there are inodes available to be reclaimed.quoted
This is on (standard up-to date Debian 10): Linux version 4.19.0-16-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.181-1 (2021-03-19) xfs_progs 4.20.0-1 File system is RAID-0, 2x2TB disks with LVM over md (512k chunks)Are these SSDs or HDDs? I'll assume HDD at this point.quoted
meta-data=/dev/mapper/vg0-home isize=512 agcount=32, agsize=29849728 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=955191296, imaxpct=5 = sunit=128 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=466402, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0Ok, so you have a 1.7GB log. If those are HDDs, then you could have hundreds of thousands of dirty inodes tracked in the log, and metadata writeback has been falling behind for days because the log can be filled much faster than it can be drained. Assuming 200 write IOPS, 30 minutes would be 360,000 writes, which pretty much matches up with having half a million dirty inodes in the log and the process exiting needing to run a bunch of transactions that need a chunk of log space to make progress and having to wait on inode writeback to free up log space...quoted
MemTotal: 32800968 kB MemFree: 759308 kB MemAvailable: 27941208 kB Buffers: 43900 kB Cached: 26504332 kB SwapCached: 7560 kB Active: 16101380 kB Inactive: 11488252 kB Active(anon): 813424 kB Inactive(anon): 228180 kB Active(file): 15287956 kB Inactive(file): 11260072 kBSo all your memory is in the page cache.quoted
Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 16777212 kB SwapFree: 16715524 kB Dirty: 2228 kBAnd almost all the page cache is clean.quoted
Writeback: 0 kB AnonPages: 1034280 kB Mapped: 89660 kB Shmem: 188 kB Slab: 1508868 kB SReclaimable: 1097804 kB SUnreclaim: 411064 kBAnd that's enough slab cache to hold half a million cached, dirty inodes... More information required. Cheers, Dave.
-- Momtchil Momtchev [off-list ref]