Thread (51 messages) 51 messages, 8 authors, 2008-09-04

Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

From: Gregory Haskins <hidden>
Date: 2008-08-27 11:53:28
Also in: lkml

Nick Piggin wrote:
On Tuesday 26 August 2008 21:36, Gregory Haskins wrote:
  
quoted
Nick Piggin wrote:
    
quoted
On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
      
quoted
git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
newidle critical section length by stopping after at least one task
was moved.  Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing.  This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.

Signed-off-by: Gregory Haskins <redacted>
CC: Nick Piggin <redacted>
        
Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could
increase the amount of work to do significantly for workloads where
the CPU is going idle and pulling tasks over frequently. I don't
really like either of them too much.
      
I had a feeling you may object to this patch based on your comments on
the first one.  Thats why I CC'd you so you wouldnt think I was trying
to sneak something past ;)
    
Appreciated.


  
quoted
quoted
Maybe increasing the limit would effectively amortize most of the
problem (say, limit to move 16 tasks at most).
      
The problem I was seeing was that even moving 2 was too many in the
ftraces traces I looked at.  I think the idea of making a variable limit
(set via a sysctl, etc) here is a good one, but I would recommend we
have the default be "1" for CONFIG_PREEMPT (or at least
CONFIG_PREEMPT_RT) based on what I know right now.   I know last time
you objected to any kind of special cases for the preemptible kernels,
but I think this is a good compromise.  Would this be acceptable?
    
Well I _prefer_ not to have a special case for preemptible kernels, but
we already have similar arbitrary kind of changes like in tlb flushing,
so...

I understand and accept there are some places where fundamentally you
have to trade latency for throughput, so at some point we have to have a
config and/or sysctl for that.

I'm surprised 2 is too much but 1 is OK. Seems pretty fragile to me.
Its not that 1 is magically "ok".  Its simply that newidle balancing
hurts latency, and 1 is the minimum to pull to reasonably reduce the
critical section.  I already check if we NEEDS_RESCHED before taking the
rq->lock in newidle, so waiting for one task to pull is the first
opportunity I have to end the section as quickly as possible.  It would
be nice if I could just keep going if  I could detect whether there was 
not any real contention.  Let me  give this angle some more thought.
 Are
you just running insane tests that load up the runqueues heaps and tests
latency? -rt users will have to understand that some algorithms scale
linearly or so with the number of a particular resource allocated, so
they aren't going to get a constant low latency under arbitrary
conditions.

FWIW, if you haven't already, then for -rt you might want to look at a
more advanced data structure than simple run ordered list for moving tasks
from one rq to the other. A simple one I was looking at is a time ordered
list to pull the most cache cold tasks (and thus we can stop searching
when we encounter the first cache hot task, in situations where it is
appropriate, etc).
  
Im not sure I follow your point, but if I do note that the RT scheduler
uses a completely different load balancer (that is priority ordered).

Anyway... yeah I'm OK with this if it is under a config option.
  
Cool..  See v2 ;)

Thanks Nick,
-Greg

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help