Re: getdents - ext4 vs btrfs performance
From: Ted Ts'o <tytso@mit.edu>
Date: 2012-03-18 20:56:58
Also in:
linux-btrfs, linux-fsdevel, lkml
On Thu, Mar 15, 2012 at 11:42:24AM +0100, Jacek Luczak wrote:
That was not a SVN server. It was a build host having checkouts of SVN projects. The many files/dirs case is common for VCS and the SVN is not the only that would be affected here.
Well, with SVN it's 2x or 3x the number of files in the checked out source code directory, right? So if a particular source tree has 2,000 files in a source directory, then SVN might have at most 6,000 files, and if you assume each directory entry is 64 bytes, we're still talking about 375k. Do you have more files than that in a directory in practice with SVN? And if so why?
AFAIR git.kernel.org was also suffering from the getdents().
git.kernel.org was suffering from a different problem, which was that the git.kernel.org administrators didn't feel like automatically doing a "git gc" on all of the repositories, and a lot of people were just doing "git pushes" and not bothering to gc their repositories. Since git.kernel.org users don't have shell access any more, the git.kernel.org administrators have to be doing automatic git gc's. By default git is supposed to automatically do a gc when there are more than 6700 loose object files (which are distributed across 256 1st level directories, so in practice a .git/objects/XX directory shouldn't have more than 30 objects in it, which each directory object taking 48 bytes). The problem I believe is that "git push" commands weren't checking gc.auto limit, and so that's why git.kernel.org had in the past suffered from large directories. This is arguably a git bug, though, and as I mentioned, since we all don't have shell access to git.kernel.org, this has to be handled automatically now....
Same applies to commercial products that are heavily stuffed with many files/dirs, e.g. ClearCase or Synergy.
How many files in a dircectory do we commonly see with these systems? I'm not familiar with them, and so I don't have a good feel for what typical directory sizes tend to be.
A medium size you are referring would most probably fit into 256k and this could be enough for 90% of cases. Large production system running on ext4 need backups thus those would benefit the most here.
Yeah, 256k or 512k is probably the best. Alternatively, the backup
programs could simply be taught to sort the directory entries by inode
number, and if that's not enough, to grab the initial block numbers
using FIEMAP and then sort by block number. Of course, all of this
optimization may or may not actually give us as much returns as we
think, given that the disk is probably seeking from other workloads
happening in parallel anyway (another reason why I am suspicious that
timing the tar command may not be an accurate way of measuring actual
performance when you have other tasks accessing the file system in
parallel with the backup).
- Ted