Thread (61 messages) 61 messages, 7 authors, 2011-04-20

Re: [PATCH 05/10] block: remove per-queue plugging

From: Dave Chinner <david@fromorbit.com>
Date: 2011-04-12 13:31:28
Also in: dm-devel, lkml

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
On 2011-04-12 14:41, Dave Chinner wrote:
quoted
On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
quoted
On 2011-04-12 14:22, Dave Chinner wrote:
quoted
On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
quoted
On 2011-04-12 03:12, hch@infradead.org wrote:
quoted
On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
quoted
Great, once you do that and XFS kills the blk_flush_plug() calls too,
then we can remove that export and make it internal only.
Linus pulled the tree, so they are gone now.  Btw, there's still some
bits in the area that confuse me:
Great!
quoted
 - what's the point of the queue_sync_plugs?  It has a lot of comment
   that seem to pre-data the onstack plugging, but except for that
   it's trivial wrapper around blk_flush_plug, with an argument
   that is not used.
There's really no point to it anymore. It's existance was due to the
older revision that had to track write requests for serializaing around
a barrier. I'll kill it, since we don't do that anymore.
quoted
 - is there a good reason for the existance of __blk_flush_plug?  You'd
   get one additional instruction in the inlined version of
   blk_flush_plug when opencoding, but avoid the need for chained
   function calls.
 - Why is having a plug in blk_flush_plug marked unlikely?  Note that
   unlikely is the static branch prediction hint to mark the case
   extremly unlikely and is even used for hot/cold partitioning.  But
   when we call it we usually check beforehand if we actually have
   plugs, so it's actually likely to happen.
The existance and out-of-line is for the scheduler() hook. It should be
an unlikely event to schedule with a plug held, normally the plug should
have been explicitly unplugged before that happens.
Though if it does, haven't you just added a significant amount of
depth to the worst case stack usage? I'm seeing this sort of thing
from io_schedule():

        Depth    Size   Location    (40 entries)
        -----    ----   --------
  0)     4256      16   mempool_alloc_slab+0x15/0x20
  1)     4240     144   mempool_alloc+0x63/0x160
  2)     4096      16   scsi_sg_alloc+0x4c/0x60
  3)     4080     112   __sg_alloc_table+0x66/0x140
  4)     3968      32   scsi_init_sgtable+0x33/0x90
  5)     3936      48   scsi_init_io+0x31/0xc0
  6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
  7)     3856     112   sd_prep_fn+0x150/0xa90
  8)     3744      48   blk_peek_request+0x6a/0x1f0
  9)     3696      96   scsi_request_fn+0x60/0x510
 10)     3600      32   __blk_run_queue+0x57/0x100
 11)     3568      80   flush_plug_list+0x133/0x1d0
 12)     3488      32   __blk_flush_plug+0x24/0x50
 13)     3456      32   io_schedule+0x79/0x80

(This is from a page fault on ext3 that is doing page cache
readahead and blocking on a locked buffer.)

I've seen traces where mempool_alloc_slab enters direct reclaim
which adds another 1.5k of stack usage to this path. So I'm
extremely concerned that you've just reduced the stack available to
every thread by at least 2.5k of space...
Yeah, that does not look great. If this turns out to be problematic, we
can turn the queue runs from the unlikely case into out-of-line from
kblockd.

But this really isn't that new, you could enter the IO dispatch path
when doing IO already (when submitting it). So we better be able to
handle that.
The problem I see is that IO is submitted when there's plenty of
stack available whould have previously been fine. However now it
hits the plug, and then later on after the thread consumes a lot
more stack it, say, waits for a completion. We then schedule, it
unplugs the queue and we add the IO stack to a place where there
isn't much space available.

So effectively we are moving the places where stack is consumed
about, and it's complete unpredictable where that stack is going to
land now.
Isn't that example fairly contrived?
I don't think so. e.g. in the XFS allocation path we do btree block
readahead, then go do the real work. The real work can end up with a
deeper stack before blocking on locks or completions unrelated to
the readahead, leading to schedule() being called and an unplug
being issued at that point.  You might think it contrived, but if
you can't provide a guarantee that it can't happen then I have to
assume it will happen.

My concern is that we're already under stack space stress in the
writeback path, so anything that has the potential to increase it
significantly is a major worry from my point of view...
If we ended up doing the IO
dispatch before, then the only difference now is the stack usage of
schedule() itself. Apart from that, as far as I can tell, there should
not be much difference.
There's a difference between IO submission and IO dispatch. IO
submission is submit_bio thru to the plug; IO dispatch is from the
plug down to the disk. If they happen at the same place, there's no
problem. If IO dispatch is moved to schedule() via a plug....
quoted
quoted
If it's a problem from the schedule()/io_schedule() path, then
lets ensure that those are truly unlikely events so we can punt
them to kblockd.
Rather than wait for an explosion to be reported before doing this,
why not just punt unplugs to kblockd unconditionally?
Supposedly it's faster to do it inline rather than punt the dispatch.
But that may actually not be true, if you have multiple plugs going (and
thus multiple contenders for the queue lock on dispatch). So lets play
it safe and punt to kblockd, we can always revisit this later.
It's always best to play it safe when it comes to other peoples
data....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help