Thread (9 messages) 9 messages, 3 authors, 2019-08-27

Re: [PATCH] ext4: attempt to shrink directory on dentry removal

From: Andreas Dilger <hidden>
Date: 2019-08-27 20:53:53

On Aug 26, 2019, at 3:46 PM, harshad shirwadkar [off-list ref] wrote:
I see, this is an interesting approach. I think we can do this without
modifying on-disk format and by bearing the cost of 2 extra reads per
merge. Whenever dentry deletion results in a dirent block that is
_sufficiently_ empty, we can look at its parent dx block and and find
its neighbors that can be merged. We can be aggressive and do this for
every dentry deletion or we could do this whenever the current dirent
block is half empty or something.
The point of storing the "leaf block fullness" into the high bits of
dx_entry->block is that this information would always be available
during htree traversal, even if the leaf blocks are not in memory.
That would allow the unlink code to make a decision in advance whether
shrinking is going to happen (e.g. leaf block plus one neighbour are
under 50% full).

The overhead would be to keep the "leaf block fullness" value updated
enough to be useful.  In this regard, a lower-resolution field (maybe
1 or 2 bits is enough, representing { unset, <= 25% full }, or { unset,
<= 20% full, <= 40% full, > 40% full }.  Anything above this threshold
isn't interesting to consider merging, so there is no need to track it.
By this method we end up reading up to 2 extra blocks (one previous
and one next) that are not going to be merged. That's the trade-off we
have to make in order to avoid any changes to on-disk structure (If we
modify the on-disk structure and store the fullness in the dx block,
we would read only the blocks that need to be merged).
Right.  As Ted wrote, we could do the shrinking optimistically if the
dx_entry field is "unset", but the blocks are in memory.  However, in
when the leaf block is first read into memory, we could also update the
dx_entry fullness when it is sanity checked, so that we don't need to
keep checking whether the leaf block is empty or not.  The dx_entry
update could be done only in memory and written opportunistically to
disk if the htree index block is modified.  At worst, if the dx_entry
fullness is incorrect the merge could fail and the block is not freed.
The same logic can also be applied to intermediate dx nodes as well.
After every merge operation, we'll have a set of blocks that need to
be freed. Once we know what blocks we can free, we can use Ted's idea
of swapping them with the last block, one by one.

Since merging approach also requires a way to free up directory
blocks, I think we could first get a patch in that can free up
directory blocks by swapping with the last block. Once we have that
then we could implement merging.
Yes, definitely.  I'm not suggesting that we don't need the ability
to free leaf/index blocks first.  I just wanted to point out that we
will not see any blocks being freed until the directory is almost
completely (99%) empty.  For testing purposes you could use 1KB blocks
and 250-character names (which means at most 3 entries per block), and
we would start seeing blocks being freed once the directory was below
about 30% full, but this does not represent a common use case.

Cheers, Andreas
On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger [off-list ref] wrote:
quoted
On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar [off-list ref] wrote:
quoted
On every dentry deletion, this patch traverses directory inode in
reverse direction and frees up empty dirent blocks until it finds a
non-empty dirent block. We leverage the fact that we never clear
dentry name when we delete dentrys by merging them with the previous
one. So, even though the dirent block has only fake dentry which spans
across the entire block, we can use the name in this dead entry to
perform dx lookup and find intermediate dx node blocks as well as
offset inside these blocks.

One high-level limitation with this implementation is that it is unlikely
to remove any directory blocks until the directory is almost entirely
empty since "rm -r" will return entries in hash order, which does not
match the order that the leaf blocks are allocated in the file.  Even
worse, if files in the directory are not deleted in hash order, no leaf
block will be completely empty until about 99% of the files have been
deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
entries per block, typically 3/4 full, or 1/96 entries will be left in
each block before it becomes empty.

One option that was discussed in the past is to use the high 4 bits
of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
"fullness" of each block (in 1/16th of a block, or 256-byte increments
for 4096-byte blocks) and try to merge entries into an adjacent block
if it becomes mostly empty (e.g. if the current block plus the neighbour
are below 50% full).  That allows removing blocks much earlier as the
directory shrinks, rather than waiting until each block is completely
empty.  A fullness of "0" would mean "unset", since we don't set it
yet, and once this feature is available there would never be a block
that is entirely empty.
quoted
As of now, we only support non-indexed directories and indexed
directories with no intermediate dx nodes. This technique can also be
used to remove intermediate dx nodes. But it needs a little more
interesting logic to make that happen since we don't store directory
entry name in intermediate nodes.

Ran kvm-xfstests smoke test-suite and verified that there are no
failures. Also, verified that when all the files are deleted in a
directory, directory shrinks to either 4k for non-indexed directories
or 8k for indexed directories with 1 level.

This patch is an improvement over previous patch that I sent out
earlier this month. So, if this patch looks alright, then we can drop
the other shrinking patch.
quoted
Signed-off-by: Harshad Shirwadkar <redacted>

---
This patch supersedes the other directory shrinking patch sent in Aug
2019 ("ext4: shrink directory when last block is empty").

fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 176 insertions(+)



+static inline bool is_empty_dirent_block(struct inode *dir,
+                                      struct buffer_head *bh)
+{
+     struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
+     int     csum_size = 0;
+
+     if (ext4_has_metadata_csum(dir->i_sb))
+             csum_size = sizeof(struct ext4_dir_entry_tail);
+
+     return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
+                     dir->i_sb->s_blocksize - csum_size && de->inode == 0;
+}
This may not always detect empty directory blocks properly, because
ext4_generic_delete_entry() will only merge deleted entries with the
previous entry.  It at least appears possible that if entries are not
deleted in the proper order (e.g. in reverse of the order they are
listed in the directory) there may be multiple empty entries in a block,
and the above check will fail.

Instead, this checks should walk all entries in a block and return false
if any one of them has a non-zero de->inode.  In the common case there
may be only a single entry, or the first entry will be used, so it
should be fairly quick to decide that the block cannot be removed.

Another option is to change ext4_generic_delete_entry() to also try
to merge with the immediately following entry to ensure that an empty
block always has rec_len of the full blocksize.  However, I think this
is probably not a worthwhile effort since it would be better to support
removing blocks that are partly empty rather than entirely empty.
quoted
@@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
     if (unlikely(err))
             goto out;

+     ext4_try_dir_shrink(handle, dir);
+
     return 0;
I think it would be inefficient to try shrinking the directory after
_every_ directory entry is removed.  Instead, there should be some
way to determine here if ext4_generic_delete_entry() removed the last
entry from the directory block, and only shrink in that case.

Cheers, Andreas




Cheers, Andreas




Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help