Thread (143 messages) 143 messages, 5 authors, 2021-05-27

Re: [PATCH 40/45] xfs: convert CIL to unordered per cpu lists

From: Dave Chinner <david@fromorbit.com>
Date: 2021-03-12 02:18:59

On Wed, Mar 10, 2021 at 05:15:05PM -0800, Darrick J. Wong wrote:
On Fri, Mar 05, 2021 at 04:11:38PM +1100, Dave Chinner wrote:
quoted
From: Dave Chinner <redacted>

So that we can remove the cil_lock which is a global serialisation
point. We've already got ordering sorted, so all we need to do is
treat the CIL list like the busy extent list and reconstruct it
before the push starts.
....
quoted
@@ -530,7 +511,6 @@ xlog_cil_insert_items(
 	 * the transaction commit.
 	 */
 	order = atomic_inc_return(&ctx->order_id);
-	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
 		/* Skip items which aren't dirty in this transaction. */
@@ -540,10 +520,26 @@ xlog_cil_insert_items(
 		lip->li_order_id = order;
 		if (!list_empty(&lip->li_cil))
 			continue;
-		list_add(&lip->li_cil, &cil->xc_cil);
+		list_add(&lip->li_cil, &cilpcp->log_items);
Ok, so if I understand this correctly -- every time a transaction
commits, it marks every dirty log item with a monotonically increasing
counter.  If the log item isn't already on another CPU's CIL list, it
gets added to the current CPU's CIL list...
Correct.
quoted
+	}
+	put_cpu_ptr(cilpcp);
+
+	/*
+	 * If we've overrun the reservation, dump the tx details before we move
+	 * the log items. Shutdown is imminent...
+	 */
+	tp->t_ticket->t_curr_res -= ctx_res + len;
+	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
+		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
+		xfs_warn(log->l_mp,
+			 "  log items: %d bytes (iov hdrs: %d bytes)",
+			 len, iovhdr_res);
+		xfs_warn(log->l_mp, "  split region headers: %d bytes",
+			 split_res);
+		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
+		xlog_print_trans(tp);
 	}
 
-	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -806,6 +802,7 @@ xlog_cil_push_work(
 	bool			commit_iclog_sync = false;
 	int			cpu;
 	struct xlog_cil_pcp	*cilpcp;
+	LIST_HEAD		(log_items);
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -822,6 +819,9 @@ xlog_cil_push_work(
 			list_splice_init(&cilpcp->busy_extents,
 					&ctx->busy_extents);
 		}
+		if (!list_empty(&cilpcp->log_items)) {
+			list_splice_init(&cilpcp->log_items, &log_items);
...and then at CIL push time, we splice each per-CPU list into a big
list, sort the dirty log items by counter number, and process them.
Yup, that's pretty much it. I'm replacing insert time ordering with
push-time ordering to get rid of the serialisation overhead of
insert time ordering.
The first thought I had was that it's a darn shame that _insert_items
can't steal a log item from another CPU's CIL list, because you could
then mergesort the per-CPU CIL lists into @log_items.  Unfortunately, I
don't think there's a safe way to steal items from a per-CPU list
without involving locks.
Yeah, it needs locks because we then have to serialise local inserts
with remote removals. It can be done fairly easily - I just need to
replace the "order ID" field with the CPU ID of the list it is on.

The problem is that relogging happens a lot, so in some workloads we
might be bouncing a set of commonly accessed log items around CPUs
frequently. That said, I'm not sure this would end up a huge
problem, but it still needs a mergesort to be performed in the push
code...
The second thought I had was that we have the xfs_pwork mechanism for
launching a bunch of worker threads.  A pwork workqueue is (probably)
too costly when the item list is short or there aren't that many CPUs,
but once list_sort starts getting painful, would it be faster to launch
a bunch of threads in push_work to sort each per-CPU list and then merge
sort them into the final list?
Not sure, because now you have N work threads competing with the
userspace workload for CPU to do maybe 10ms of work. The scheduling
latency when the system is CPU bound is likely to introduce more
latency than you save by spreading the work out....

I've largely put these sorts of questions aside because optimising
this code further can be done later. The code as it stands doubles
the throughput of the commit path and I don't think that further
optimisation is immediately necessary. Ensuring that the splitting
and recombining of the lists still results in correctly ordered log
items is more important right now, and I think it does that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help