Re: RAID1 round robin read support in md?

From: Roberto Spadim <hidden>
Date: 2011-12-05 12:28:54

check this old topic... (kernel 2.6.37 i think)
http://www.spadim.com.br/raid1/
http://www.issociate.de/board/post/507493/raid1_new_read_balance,_first_test,_some_doubt,_can_anyone_help?.html


but you will not have a veeeery big read speed....
raid1 is for multi thread work
raid10 far is for less threads with more sequencial reads
check what is better to you

this code at top of email, implement a round robin, if you do each
read one device you will have a slow md device, but changing to
10,100,1000 reads to change device, is better (i think the problme is
cpu use or another problem... must check with iostat and others
statistics softwares)

2011/12/5 Doug Dumitru [off-list ref]

What you are seeing is very SSD specific.

With rotating media, it is very important to intentionally stay on one
disk even if it leaves other mirrors quiet.  Rotating disks do "in the
drive" read-ahead and take advantage of the heads being on the correct
track, so streaming straight-line reads are efficient.

With SSDs in an array, things are very different.  Drives don't really
read ahead at all (actually they do, but this is more of a side effect
of error correction than performance tuning, and the lengths are
short).  If your application is spitting out 4MB read requests, they
get cut into 512K (1024 sector) bio calls, and sent to a single drive
if they are linear.  Because the code is optimized for HDDs, future
linear calls should go to the same drive because an HDD is very likely
to have at least some of the read sectors in the read-ahead cache.

A different algorithm for SSDs would be better, but one concern is
that this might slow down short read requests in a multi-threaded
environment.  Actually managing a mix intelligently is probably best
started with a Google literature search for SSD scheduling papers.  I
suspect that UCSD's super-computing department might have done some
work in this area.

With the same data available from two drives, for low thread count
applications, it might be better to actually cut up the inbound
requests into even smaller chunks, and send them in parallel to the
drives.  A quick test on a Crucial C300 shows the following transfer
rates at different block sizes.

512K  319 MB/sec
256K  299 MB/sec
128K  298 MB/sec
64K  287 MB/sec
32K  275 MB/sec

This is with a single 'dd' process and 'iflag=direct' bypassing linux
read-ahead and buffer caching.  The test was only a second long or so,
so the noise could be quite high.  Also, C300s may behave very
differently with this workload than other drives, so you have to test
each type of disk.

What this implies is that if the md raid-1 layer "were to be" SSD
aware, it should consider cutting up long requests and keeping all
drives busy.  The logic would be something like:

* If any request is >= 32K, split it into 'n' parts', and issue them
in parallel.

This would be best implemented "down low" in the md stack.
Unfortunately, the queuing where requests are collated, happens below
md completely (I think), so there is no easy point to insert this.

The idea of round-robin scheduling the requests is probably a little
off-base.  The important part is, with SSDs, to cut up the requests
into smaller sizes, and push them in parallel.  A round-robin might
trick the scheduler into this sometimes, but is probably only an
edge-case solution.

This same logic applies to raid-0, raid-5/6, and raid-10 arrays.  With
HDDs is is often more efficient to keep the stripe size large so that
individual in-drive read-ahead is exploited.  With SSDs, smaller
stripes are often better (at least on reads) because it tends to keep
all of the drive busy.

Now it is important to note that this discussion is 100% about reads.
SSD writes are a much more complicated animal.

--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help