Thread (11 messages) 11 messages, 5 authors, 2012-08-16

Re: O_DIRECT to md raid 6 is slow

From: Andy Lutomirski <luto@amacapital.net>
Date: 2012-08-15 22:11:08
Also in: lkml

On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner [off-list ref] wrote:
On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
quoted
On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
[off-list ref] wrote:
quoted
On 15/08/2012 01:49, Andy Lutomirski wrote:
quoted
If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
[...]
quoted
It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

I see your md device is partitioned. Is the partition itself stripe-aligned?
Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
      11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UUUUUU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.
It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.
Grr.  I thought the bad old days of filesystem and related defaults
sucking were over.  cryptsetup aligns sanely these days, xfs is
sensible, etc.  wtf?  <rant>Why is there no sensible filesystem for
huge disks?  zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.</rant>

Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

--Andy
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help