Thread (85 messages) 85 messages, 8 authors, 2007-09-02

Re: [1/1] Block device throttling [Re: Distributed storage.]

From: Daniel Phillips <hidden>
Date: 2007-08-28 21:08:36

On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips (phillips@phunq.net) wrote:
quoted
quoted
We do not care about one cpu being able to increase its counter
higher than the limit, such inaccuracy (maximum bios in flight
thus can be more than limit, difference is equal to the number of
CPUs - 1) is a price for removing atomic operation. I thought I
pointed it in the original description, but might forget, that if
it will be an issue, that atomic operations can be introduced
there. Any uber-precise measurements in the case when we are
close to the edge will not give us any benefit at all, since were
are already in the grey area.
This is not just inaccurate, it is suicide.  Keep leaking throttle
counts and eventually all of them will be gone.  No more IO
on that block device!
First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions.
No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.
Let's solve problems in order of their appearence. If bio structure
will be allowed to grow, then the whole patches can be done better.
How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.
--- 2.6.22.clean/block/ll_rw_blk.c	2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/block/ll_rw_blk.c	2007-08-24 12:07:16.000000000 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+	if (q && q->metric) {
+		int need = bio->bi_reserved = q->metric(bio);
+		bio->queue = q;
+		wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
+		atomic_sub(&q->available, need);
+	}
+
 	if (current->bio_tail) {
 		/* make_request is active */
 		*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c	2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/drivers/md/dm.c	2007-08-24 12:14:23.000000000 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	atomic_set(&md->queue->available, md->queue->capacity = 1000);
+	init_waitqueue_head(&md->queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c	2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/fs/bio.c	2007-08-24 12:10:41.000000000 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
 		bytes_done = bio->bi_size;
 	}
 
-	bio->bi_size -= bytes_done;
+	if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+		struct request_queue *q = bio->queue;
+		atomic_add(&q->available, bio->bi_reserved);
+		bio->bi_reserved = 0; /* just in case */
+		wake_up(&q->throttle_wait);
+	}
 	bio->bi_sector += (bytes_done >> 9);
 
 	if (bio->bi_end_io)
--- 2.6.22.clean/include/linux/bio.h	2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/include/linux/bio.h	2007-08-24 11:53:51.000000000 -0700
@@ -109,6 +109,9 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 	atomic_t		bi_cnt;		/* pin count */
 
+	struct request_queue	*queue;		/* for throttling */
+	unsigned		bi_reserved;	/* throttle metric */
+
 	void			*bi_private;
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
--- 2.6.22.clean/include/linux/blkdev.h	2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/include/linux/blkdev.h	2007-08-24 12:04:14.000000000 -0700
@@ -395,6 +395,10 @@ struct request_queue
 	struct work_struct	unplug_work;
 
 	struct backing_dev_info	backing_dev_info;
+	unsigned (*metric)(struct bio *bio);	/* bio throttle metric */
+	wait_queue_head_t	throttle_wait;
+	atomic_t		available;
+	unsigned		capacity;
 
 	/*
 	 * The queue owner gets to use this for whatever they like.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help