Thread (51 messages) 51 messages, 8 authors, 2008-09-04

Re: [PATCH 3/5] sched: make double-lock-balance fair

From: Nick Piggin <hidden>
Date: 2008-08-27 11:53:52
Also in: lkml

On Wednesday 27 August 2008 21:41, Gregory Haskins wrote:
Nick Piggin wrote:
quoted
On Tuesday 26 August 2008 22:23, Gregory Haskins wrote:
quoted
Nick Piggin wrote:
quoted
On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
quoted
double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs.  This can result in higher latencies on
higher cpu-ids.

This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again.  Since callsites to
double_lock_balance already consider it a potential
preemption/reschedule point, they have the proper logic to recheck for
atomicity violations.

Signed-off-by: Gregory Haskins <redacted>
---

 kernel/sched.c |   17 +++++------------
 1 files changed, 5 insertions(+), 12 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 6e0bde6..b7326cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq
*this_rq, struct rq *busiest) __acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
 	if (unlikely(!irqs_disabled())) {
 		/* printk() doesn't work good under rq->lock */
 		spin_unlock(&this_rq->lock);
 		BUG_ON(1);
 	}
-	if (unlikely(!spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			spin_unlock(&this_rq->lock);
-			spin_lock(&busiest->lock);
-			spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
-	}
-	return ret;
+
+	spin_unlock(&this_rq->lock);
+	double_rq_lock(this_rq, busiest);
Rather than adding the extra atomic operation, can't you just put this
into the unlikely spin_trylock failure path rather than the unfair
logic there?
The trick is that we *must* first release this_rq before proceeding or
the new proposal doesn't work as intended.  This patch effectively
breaks up the this_rq->lock critical section evenly across all CPUs as
if it hit the case common for higher cpus.
I don't exactly see why my proposal would introduce any more latency,
because we only trylock while holding the existing lock -- this is will
only ever add a small ~constant time to the critical section, regardless
of whether it is a high or low CPU runqueue.
Its because we are trying to create a break in the critical section of
this_rq->lock, not improve the acquisition of busiest->lock.  So whether
you do spin_lock or spin_trylock on busiest does not matter.  Busiest
will not be contended in the case that I am concerned with.  If you use
my example below: rq[N] will not be contended because cpuN is blocked on
rq[0] after already having released rq[N].  So its the contention
against this_rq that is the problem.

Or am I missing your point completely?
Well my point is just that after my change, there may be some windows
of "unfair" behaviour where one CPU gets to go ahead early, but AFAIKS
now there is no systemic bias against higher numbered CPUs (except the
small issue of taking lowered numbered locks first which is also present
in any solution).

As such, I would actually like to see my proposal implemented in the
!lowlatency case as well... unless my reasoning has a hole in it?

But if you are _also_ wanting to eliminate the long lock hold time and
strictly improve fairness for lowlatency kernels, then I agree that your
patch goes much further than mine, so I don't object to putting that
under CONFIG_PREEMPT.

quoted
quoted
This modification decreased
latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my
8-way box namely because they were not forced to wait for all the other
lower cores to finish, but rather completions of double_lock_balance
were handled in true FIFO w.r.t. to other calls to
double_lock_balance().  It has to do with the positioning within your
FIFO ticket locks (though even if ticket locks are not present on a
given architecture we should still see an improvement.)

When a low cpu wants to double lock, it tends to hold this_rq and gets
in line for busiest_rq with no bearing on how long it held this_rq.
Therefore the following scenario can occur:

cpu 0                     cpu N
----------------------------------
rq[0] locked
..
..
..
                               double_lock(N, 0)
                               rq[N] released
                               blocked on rq[0]
..
..
..
..
double_lock(0, N)
rq[N] locked
double_lock returns
..
..
..
..
rq[0] released         rq[0] locked
                                  double_lock returns
                                  ...
                                  ...
                                  ...

---------------------------------

So double lock acquisition favors the lower cpus unfairly.  They will
always win, even if they were not first.  Now with the combination of my
patch plus your ticket locks, entry into the double lock becomes FIFO
because the "blocked on rq[0]" would have inserted it in the
time-ordered head of rq[0].
Right, but I don't think it is particularly wrong to allow a given
CPU to double_lock_balance ahead of another guy if we're already holding
the lock.
Its not "wrong".  Its just a latency source ;)
OK, sure.

quoted
 _So long as_ the lock we are trying to acquire is uncontended,
and we don't introduce this skewed unfairness due to lower CPUs being
allowed to hold their lower lock while higher CPUs have to release their
lock and first queue on the lower.

The difference is that with my patch, there is a small window where the
guy who asks for the double lock first will go through second. I don't
think this really adds a fundamental amount of latency, and the
performance benefit should not be ignored.

Linux's traditional and I suppose much largest user base does not require
realtime or really strict fairness, so IMO it is always questionable to
make changes like this.
Please take a look at the v2 series that I sent out yesterday.  I have
now predicated this on CONFIG_PREEMPT, per your comments.
Those patches look pretty OK with me now.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help