Thread (85 messages) 85 messages, 8 authors, 2007-09-02

Re: Distributed storage.

From: Jens Axboe <hidden>
Date: 2007-08-13 09:12:35
Also in: linux-fsdevel, netdev

On Mon, Aug 13 2007, Daniel Phillips wrote:
On Monday 13 August 2007 00:28, Jens Axboe wrote:
quoted
On Sun, Aug 12 2007, Daniel Phillips wrote:
quoted
Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
derive efficiently from BIO_POOL_IDX() provided the bio was
allocated in the standard way.
That would only be feasible, if we ruled that any bio in the system
must originate from the standard pools.
Not at all.
quoted
quoted
This leaves a little bit of clean up to do for bios not allocated
from a standard pool.
Please suggest how to do such a cleanup.
Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
bio from the standard pool.  Just put the power of two size in the bits 
and map that number to the standard pool arrangement with a table 
lookup.
So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?
quoted
quoted
On the other hand, vm writeout deadlock ranks smack dab at the top
of the list, so that is where the patching effort must go for the
forseeable future.  Without bio throttling, the ddsnap load can go
to 24 MB for struct bio alone.  That definitely moves the needle. 
in short, we save 3,200 times more memory by putting decent
throttling in place than by saving an int in struct bio.
Then fix the damn vm writeout. I always thought it was silly to
depend on the block layer for any sort of throttling. If it's not a
system wide problem, then throttle the io count in the
make_request_fn handler of that problematic driver.
It is a system wide problem.  Every block device needs throttling, 
otherwise queues expand without limit.  Currently, block devices that 
use the standard request library get a slipshod form of throttling for 
free in the form of limiting in-flight request structs.  Because the 
amount of IO carried by a single request can vary by two orders of 
magnitude, the system behavior of this approach is far from 
predictable.
Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.
quoted
quoted
You did not comment on the one about putting the bio destructor in
the ->endio handler, which looks dead simple.  The majority of
cases just use the default endio handler and the default
destructor.  Of the remaining cases, where a specialized destructor
is needed, typically a specialized endio handler is too, so
combining is free.  There are few if any cases where a new
specialized endio handler would need to be written.
We could do that without too much work, I agree.
OK, we got one and another is close to cracking, enough of that.
No we did not, I already failed this one in the next mail.
quoted
quoted
As far as code stability goes, current kernels are horribly
unstable in a variety of contexts because of memory deadlock and
slowdowns related to the attempt to fix the problem via dirty
memory limits.  Accurate throttling of bio traffic is one of the
two key requirements to fix this instability, the other other is
accurate writeout path reserve management, which is only partially
addressed by BIO_POOL.
Which, as written above and stated many times over the years on lkml,
is not a block layer issue imho.
Whoever stated that was wrong, but this should be no surprise.  There 
have been many wrong things said about this particular bug over the 
years.  The one thing that remains constant is, Linux continues to 
deadlock under a variety of loads both with and without network 
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies.  Even 
the network becomes part of the problem only because it lies in the 
block IO path.
The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

-- 
Jens Axboe
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help