Re: O_DIRECT to md raid 6 is slow

From: Stan Hoeppner <hidden>
Date: 2012-08-20 04:44:25

I'm copying Dave C. as he apparently misunderstood the behavior of
md/RAID6 as well.  My statement was based largely on Dave's information.
 See [1] below.

On 8/19/2012 7:01 PM, NeilBrown wrote:

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner [off-list ref]
wrote:

Since we are trying to set the record straight....

Thank you for finally jumping in Neil--had hoped to see your
authoritative information sooner.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.

quoted

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed?  Sad. :-)

From a performance standpoint, absolutely not.  Though I wouldn't be
surprised if there are a few parity RAID implementations out there that
do always write a full stripe for other reasons, such as catching media
defects as early as possible, i.e. those occasions where bits in a
sector may read just fine but can't be re-magnetized.  I'm not
championing such an idea, merely stating that others may use this method
for this or other reasons.


[1]
On 6/25/2012 9:30 PM, Dave Chinner wrote:

You can't, simple as that. The maximum supported is 256k. As it is,
a default chunk size of 512k is probably harmful to most workloads -
large chunk sizes mean that just about every write will trigger a
RMW cycle in the RAID because it is pretty much impossible to issue
full stripe writes. Writeback doesn't do any alignment of IO (the
generic page cache writeback path is the problem here), so we will
lamost always be doing unaligned IO to the RAID, and there will be
little opportunity for sequential IOs to merge and form full stripe
writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).

IOWs, every time you do a small isolated write, the MD RAID volume
will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
Given that most workloads are not doing lots and lots of large
sequential writes this is, IMO, a pretty bad default given typical
RAID5/6 volume configurations we see....


-- 
Stan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help