Thread (130 messages) 130 messages, 15 authors, 2013-04-17

Re: RAID performance - 5x SSD RAID5 - effects of stripe cache sizing

From: Adam Goryachev <hidden>
Date: 2013-03-05 15:53:29

On 05/03/13 20:30, Stan Hoeppner wrote:
On 3/4/2013 10:26 AM, Adam Goryachev wrote:
quoted
quoted
Whatever value you choose, make it permanent by adding this entry to
root's crontab:

@reboot		/bin/echo 32768 > /sys/block/md0/md/stripe_cache_size
Already added to /etc/rc.local along with the config to set the deadline
scheduler for each of the RAID drives.
You should be using noop for SSD, not deadline.  noop may improve your
FIO throughput, nad real workload, even further.
OK, done now...
Also, did you verify with a reboot that stripe_cache_size is actually
being set correctly at startup?  If it's not working as assumed you'll
be losing several hundred MB/s of write throughput at the next reboot.
Something this critical should always be tested and verified.
Will do, thanks for the nudge...
quoted
stripe_cache_size = 4096
quoted
   READ: io=131072MB, aggrb=2504MB/s, minb=2564MB/s, maxb=2564MB/s, mint=52348msec, maxt=52348msec
  WRITE: io=131072MB, aggrb=1590MB/s, minb=1628MB/s, maxb=1628MB/s, mint=82455msec, maxt=82455msec
Wow, we're up to 1.6 GB/s data throughput, 2 GB/s total md device
throughput.  That's 407MB/s per SSD.  This is much more inline with what
one would expect from a RAID5 using 5 large, fast SandForce SSDs.  This
is 80% of the single drive streaming write throughput of this SSD model,
as tested by Anandtech, Tom's, and others.

I'm a bit surprised we're achieving 2 GB/s parity write throughput with
the single threaded RAID5 driver on one core.  Those 3.3GHz Ive Bridge
cores are stouter than I thought.  Disabling HT probably helped a bit
here.  I'm anxious to see the top output file for this run (if you made
one--you should for each and every FIO run).  Surely we're close to
peaking the core here.
I'll run some more tests on the box soon, and make sure to collect the
top outputs for each run. Will email the lot when done. (See below why
there will be some delay).
quoted
stripe_cache_size = 8192
quoted
   READ: io=131072MB, aggrb=2487MB/s, minb=2547MB/s, maxb=2547MB/s, mint=52697msec, maxt=52697msec
  WRITE: io=131072MB, aggrb=1521MB/s, minb=1557MB/s, maxb=1557MB/s, mint=86188msec, maxt=86188msec
Interesting.  4096/8192 are both higher by ~300MB/s compared to the
previous 1292MB/s you posted for 8192.  Some other workload must have
been active during the previous run, or something else has changed.
Every run I took in this email was actually done twice, and I used the
larger result in the email (since we are trying to compare max
performance). However, I'm pretty sure the two runs were very similar in
results (less than 6MB/s difference).... I thought that maybe I should
have averaged the results, or run more tests, but really, I'm not that
seriously benchmarking to sell the stuff, I just need to know which one
worked best...
quoted
stripe_cache_size = 16384
quoted
   READ: io=131072MB, aggrb=2494MB/s, minb=2554MB/s, maxb=2554MB/s, mint=52556msec, maxt=52556msec
  WRITE: io=131072MB, aggrb=1368MB/s, minb=1401MB/s, maxb=1401MB/s, mint=95779msec, maxt=95779msec
stripe_cache_size = 32768
quoted
   READ: io=131072MB, aggrb=2489MB/s, minb=2549MB/s, maxb=2549MB/s, mint=52661msec, maxt=52661msec
  WRITE: io=131072MB, aggrb=1138MB/s, minb=1165MB/s, maxb=1165MB/s, mint=115209msec, maxt=115209msec
This is why you test, and test, and test when tuning for performance.
4096 seems to be your sweet spot.
Yep, I ran those tests a lot more times (4096, 8192 and 16384) to try
and see if it was an anomaly, or some other strange effect...
quoted
(let me know if you want the full fio output....)
No, the summary is fine.  What's more more valuable to have the top
output file for each run so I can see what's going on.  At 2 GB/s of
throughput your interrupt rate should be pretty high, and I'd like to
see the IRQ spread across the cores, as well as the RAID5 thread load,
among other things.  I haven't yet looked at the file you sent, but I'm
guessing it doesn't include this 1.6GB/s run.  I'm really interested in
seeing that one, and the ones for 16384 and 32768.  WRT the latter two,
I'm curious whether the much larger tables are causing excessive CPU
burn, which may in turn be what lowers throughput.
OK, will prepare and send soon...
quoted
This seems to show that DRBD did not slow things down at all... I don't
I noticed.
quoted
remember exactly when I did the previous fio tests with drbd connected,
but perhaps I've made changes to the drbd config since then and/or
upgraded from the debian stable drbd to 8.3.15
Maybe it wasn't actively syncing when you made these FIO runs.
It was "in sync" prior to running the tests, and remained in sync during
the tests... However, with the newer 8.3.15 I've adjusted the config so
that if the secondary falls behind, it will drop out of sync, and catch
up when it can. There is no way the secondary can be writing at
1.6GB/sec over a 1Gbps ethernet, to a 4 x 2TB RAID10 HDD's....
quoted
Let's re-run the above tests with DRBD stopped:
...
quoted
stripe_cache_size = 4096
quoted
   READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
  WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec
stripe_cache_size = 8192
quoted
   READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
  WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec
...

Numbers are identical.  Either BRBD wasn't actually copying anything
during the previous FIO run, its nice level changed, its
configuration/behavior changed with the new version, or something.
Whatever the reason, it appears to be putting no load on the array.
Very surprising indeed... I'll still keep DRBD disconnected during the
day until I get a better handle on what is going on here.... I would
have expected *some* impact....
quoted
So, it looks like the ideal value is actually smaller (4096) although
there is not much difference between 8192 and 4096. It seems strange
that a larger cache size will actually reduce performance... I'll change
It's not strange at all, but expected.  As a table gets larger it takes
more CPU cycles to manage it and more memory bandwidth; your cache miss
rate increases, etc.  At a certain point this overhead becomes
detrimental instead of beneficial.  In your case the size of the cache
table outweighs the overhead and yields increased performance up to 80MB
table size.  At 160MB and above the size of the table creates more
overhead than performance benefit.

This is what system testing/tuning is all about.
Of course, I suppose I assumed cache table management had zero cost
(CPU/memory bandwidth) but at these speeds, it would be quite a big
factor...
quoted
to 4096 for the time being, unless you think "real world" performance
might be better with 8192?
These FIO runs are hitting your IO subsystem much harder than your real
workloads every will.  Stick with 4096.
Very true... At 1.6GB/s that is equivalent (approx) to 8 x 1Gbps
ethernet, which is the maximum that all the machines can push at the
same time... and that is only write performance, read performance is
even higher.
quoted
Here are the results of re-running fio using the previous config (with
drbd connected with the stripe_cache_size = 8192):
quoted
   READ: io=4096MB, aggrb=2244MB/s, minb=2298MB/s, maxb=2298MB/s, mint=1825msec, maxt=1825msec
  WRITE: io=4096MB, aggrb=494903KB/s, minb=506780KB/s, maxb=506780KB/s, mint=8475msec, maxt=8475msec
Perhaps the old fio test just isn't as well suited to the way drbd
handles things. Though the issue would be what sort of data the real
users are doing, because if that matches the old fio test or the new fio
test, it makes a big difference.
The significantly lower throughput of the "old" FIO job has *nothing* to
do with DRBD.  It has everything to do with the parameters of the job
file.  I thought I explained the differences previously.  If not, here
you go:
Thanks :)
quoted
So, it looks like it is the stripe_cache_size that is affecting
performance, and that DRBD makes no difference whether it is connected
or not. Possibly removing it completely would increase performance
somewhat, but since I actually do need it, and that is somewhat
destructive, I won't try that :)
I'd do more investigating of this.  DRBD can't put zero load on the
array if it's doing work.  Given it's a read only workload, it's
possible the increased stripe cache is allowing full throttle writes
while doing 100MB/s of reads, without writes being impacted.  You'll
need to look deeper into the md statistics and/or monitor iostat, etc,
during runs with DRBD active and actually moving data.
Yes, will check this out more carefully before I will re-enable DRBD
during the day....
quoted
quoted
FIO runs on Windows:  http://www.bluestop.org/fio/
Will check into that, it will be the ultimate end-to-end test.... Also,
Yes, it will.  As long as you're running at least 16-32 threads per TS
client to overcome TCP/iSCSI over GbE latency, and the lack of AIO on
Windows.  And you can't simply reuse the same job file.  The docs tell
you which engine, and other settings, to use for Windows.
Well, I used mostly the same fio file... just changed the engine, and
size of the test down to 1GB (so the test would finish more quickly)
quoted
Hmmm, good point, I realised I could try and upgrade to the x64 windows
2003, but I think I'd prefer to just move up to 2008 x64 (or 2012)...
For now, I'll just keep using my hacky 4GB RAM drive for the pagefile...
Or violate BCP and run two TS instances per Xen, or even four, with the
appropriate number of users per each.  KSM will consolidate all the
Windows and user application read only files (DLLs, exes, etc), yielding
much more free real memory than with a single Windows TS instance.
AFAIK Windows has no memory merging so you can't over commit memory
other than with the page file, which is horribly less efficient than KSM.
BCP = Best Computing Practise ?
KSM = Kernel SamePage Merging ? (Had to ask wikipedia for this one)...

I'm not sure xen supports this currently.... However, in addition to
either saving RAM / spending more CPU managing this, there is also the
licensing consideration of purchasing more windows server licenses.
Overall, probably better spend on newer versions/upgrading...
quoted
I meant I hadn't crossed off as many items from my list of things to
do... Not that I hadn't improved performance significantly :)
I know, was just poking you in the ribs. ;)
Ouch :)
quoted
quoted
To find out how much of the 732MB/s write throughput increase is due to
buffering 512 stripes instead of 16, simply change it back to 256,
re-run my FIO job file, and subtract the write result from 1292MB/s.
So, running your FIO job file with the original 256 give a write speed
of 950MB/s and the previous FIO file gives 509MB/s. So it would seem the
increase in stripe_cache_size from 256 to 4096 give an increase in your
FIO job from 950MB/s to 1634MB/s which is a significant speed boost. I
72 percent increase with this synthetic workload, by simply increasing
the stripe cache.  Not bad eh?  This job doesn't present an accurate
picture of real world performance though, as most synthetic tests don't.

Get DRBD a hump'n and your LVM snapshot(s) in place, all the normal
server side load, then fire up the 32 thread FIO test on each TS VM to
simulate users (I could probably knock out this job file if you like).
Then monitor the array throughput with iostat or similar.  This would be
about as close to peak real world load as you can get.
Interestingly I noted that fio can run in server/client mode, so in
theory I should be able to run a central job to instruct all the other
machines to start testing at the same time.... I'll work on this soon...
quoted
must wonder why we have a default of 256 when this can make such a
significant performance improvement?  A value of 4096 with a 5 drive raid
array is only 80MB of cache, I suspect very few users with a 5 drive
RAID array would be concerned about losing 80MB of RAM, and a 2 drive
RAID array would only use 32MB ...
The stripe cache has nothing to do with device count, but hardware
throughput.  Did you happen to notice what occurred when you increased
cache size past your 4096 sweet spot to 32768?  Throughput dropped by
~500MB/s, almost 1/3rd.  Likewise, for the slow rust array whose sweet
spot is 512, making the default 4096 will decrease his throughput, and
eat 80MB RAM for nothing.  Defaults are chosen to work best with the
lowest common denominator hardware, not the Ferrari.
Oh yeah, I forgot about HDD's :) However, I would have thought the cache
would be even more effective when the CPU/memory is so much faster than
the storage medium.... Oh well, that is somebody else's performance
testing/tuning job to work out, I've got enough on my plate right now :)


Thanks to the tip about running fio on windows, I think I've now come
full circle.... Today I had numerous complaints from users that their
outlook froze/etc, and some cases were the TS couldn't copy a file from
the DC to it's local C: (iSCSI). The cause was the DC was logging events
with event ID 2020 which is "The server was unable to allocate from the
system paged pool because the pool was empty". Supposedly the solution
to this is tuning two random numbers in the registry, not much is said
what the consequences of this are, nor about how to calculate the
correct value. However, I think I've worked it out... first, let's look
at the fio results.

Running fio on one of the TS (win2003) against it's local C: (xen ->
iSCSI -> etc) gives this result:
READ: io=16384MB, aggrb=239547KB/s, minb=239547KB/s, maxb=239547KB/s, mint=70037msec, maxt=0msec
WRITE: io=16384MB, aggrb=53669KB/s, minb=53669KB/s, maxb=53669KB/s, mint=312601msec, maxt=0msec
To me, the read performance is as good as it can get (239MB/s looks like
2 x 1Gbps ethernet performance)...
The write performance might be a touch slow, but 53MB/s should be more
than enough to keep the users happy. I can come back to this later,
would be nice to see this closer to 200MB/s...

Running the same fio test on the same TS (win2003) against a SMB share
from the DC (SMB -> Win2000 -> Xen -> iSCSI -> etc)
READ: io=16384MB, aggrb=14818KB/s, minb=14818KB/s, maxb=14818KB/s, mint=1132181msec, maxt=0msec
WRITE: io=16384MB, aggrb=8039KB/s, minb=8039KB/s, maxb=8039KB/s, mint=2086815msec, maxt=0msec
This is pretty shockingly slow, and seems to clearly indicate why the
users are so upset... 14MB/s read and 8MB/s write, it's a wonder they
haven't formed a mob and lynched me yet!

However, the truly useful information is that during the read portion of
the test, the DC has a CPU load of 100% (no variation, just pegged at
100%), during the write portion, it fluctuates between 80% to 100%.

This could also indicate why the pool was empty, if the CPU is so busy,
it doesn't have time to clean the pool, and so it runs out... One of the
registry entries was to start cleaning the pool sooner (default 80%
suggested to reduce down to 60% or even 40%).

So, I tried again to re-configure windows to support multiprocessor, but
that was another clear failure. (You can change the value/driver in
windows easily, but on reboot it fails to find the HDD, so BSoD or
usually just hangs). Supposedly this can be changed with a "install on
top", but I'll need to take a copy and test that out remotely.
Especially being the DC I am not very comfortable with that.

Next option is to take another shot at upgrade to Win2003, which should
solve the multiprocessor issue, as well as provide much better support
for virtualisation. Though again, it's a major upgrade and could just
introduce a whole bunch of other problems....

Anyway, I've tried to tune a few basic things:
Remove some old devices from Device Manager on the DC
Uninstall some applications/drivers
Disable old unused services (backup software)
Extended the data drive from 279GB to 300GB (it was 90% full, now 84% full)
Adjusted registry entry to try and allocate additional memory to the pool
Increased xen memory allocation for the DC VM from 4096MB to 4200MB. I
suspect xen was keeping some of this memory for it's own overhead, and I
want the VM to get a full 4GB.

I just need to restart the san, to check it is picking up the right
settings on boot, and then put everything back online, and I'm done for
another night....

I'll come back to the benchmarking as soon as I get this DC CPU issue
resolved.

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help