Re: [PATCH V2 7/7] ext4: get discard out of jbd2 commit kthread contex

From: Wang Jianchao <hidden>
Date: 2021-05-28 03:07:00
Also in: linux-fsdevel


On 2021/5/28 4:18 AM, Andreas Dilger wrote:

On May 26, 2021, at 2:44 AM, Wang Jianchao [off-list ref] wrote:

quoted

Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.

One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.

This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
- mark blocks to be discarded as used if they have not been allocated
- do discard
- mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.

I definitely agree that sharing the existing fstrim functionality makes
the most sense here.  Some comments inline on the implementation.

quoted

Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Wang Jianchao <redacted>
---
fs/ext4/ext4.h    |   2 +
fs/ext4/mballoc.c | 162 +++++++++++++++++++++++++++++++++---------------------
fs/ext4/mballoc.h |   3 -
3 files changed, 101 insertions(+), 66 deletions(-)

@@ -3024,30 +3039,77 @@ static inline int ext4_issue_discard(struct super_block *sb,

		return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
}

-static void ext4_free_data_in_buddy(struct super_block *sb,
-				    struct ext4_free_data *entry)
+static void ext4_discard_work(struct work_struct *work)
{
+	struct ext4_sb_info *sbi = container_of(work,
+			struct ext4_sb_info, s_discard_work);
+	struct super_block *sb = sbi->s_sb;
+	ext4_group_t ngroups = ext4_get_groups_count(sb);
+	struct ext4_group_info *grp;
+	struct ext4_free_data *fd, *nfd;
	struct ext4_buddy e4b;
-	struct ext4_group_info *db;
-	int err, count = 0, count2 = 0;
+	int i, err;
+
+	for (i = 0; i < ngroups; i++) {
+		grp = ext4_get_group_info(sb, i);
+		if (RB_EMPTY_ROOT(&grp->bb_discard_root))
+			continue;

For large filesystems there may be millions of block groups, so it
seems inefficient to scan all of the groups each time the work queue

Yes it seems to be. At the moment I cooked the patch, I thought kwork is
running on background, it should not be a big deal.

is run.  Having a list of block group numbers, or bitmap/rbtree/xarray
of the group numbers that need to be trimmed may be more efficient?

Maybe we can use a bitmap to record the bgs that need to be trimed

Best regards
Jianchao

Most of the complexity in the rest of the patch goes away if the trim
tracking is only done on a whole-group basis (min/max or just a single
bit per group).

Cheers, Andreas

quoted

-	mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help