Thread (130 messages) 130 messages, 15 authors, 2013-04-17

Re: RAID performance

From: Adam Goryachev <hidden>
Date: 2013-02-10 04:40:23

Stan Hoeppner [off-list ref] wrote:
On 2/8/2013 12:44 PM, Adam Goryachev wrote:
quoted
On 09/02/13 04:10, Stan Hoeppner wrote:
quoted
quoted
From the switch stats, ports 5 to 8 are the bonded ports on the
storage
quoted
quoted
quoted
server (iSCSI traffic):

Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX 
BroadcastTX
quoted
quoted
quoted
5    734007958  0         110         120729310  0         0
6    733085348  0         114         54059704   0         0
7    734264296  0         113         45917956   0         0
8    732964685  0         102         95655835   0         0
I'm glad I asked you for this information.  This clearly shows that
the
quoted
quoted
server is performing LACP round robin fanning nearly perfectly.  It
also
quoted
quoted
shows that the bulk of the traffic coming from the W2K DC, which
apparently hosts the Windows shares for TS users, is being pumped to
the
quoted
quoted
storage server over port 5, the first port in the switch's bonding
group.  The switch is doing adaptive load balancing with
transmission
quoted
quoted
instead of round robin.  This is the default behavior of many
switches
quoted
quoted
and is fine.
Is there some method to fix this on the switch? I have configured the
switch that those 4 ports are a single LAG, which I assumed meant the
switch would be smart enough to load balance properly... Guess I
never
quoted
checked that side of it though...
After thinking this through more thoroughly, I realize your IO server
may be doing broadcast aggregation and not round robin.  However, in
either case this is bad, as it will cause out of order packets or
duplicate packets.  Both of these are wrong for your network
architecture and will cause problems.  RR will cause TCP packets to be
reassembled out of sequence, causing extra overhead at the receiver,
and
possibly errors if not reassembled in correct order.  Broadcast will
cause duplicate packets to arrive, at the receiver, which must discard
them.  Both flood the receiver's switch port.
They were definitely RR before.
The NIC ports on the IO server need to be configured as 802.3ad Dynamic
if using the Linux bonding driver.  If you're using the Intel driver
LACP it should be set to this as well, though the name may be
different.

Once you do you realize (for me again, as it's been a while) that any
single session will be limited by default to a single physical link of
the group.  LACP only gives increased bandwidth across links when
multiple sessions are present.  This is done to preserve proper packet
ordering per session, which is corrupted when fanning packets of a
single session across all links. In the default Dynamic mode, you don't
have the IO server flooding the DC with more packets than it can
handle, because the two hosts will be communicating over the same
link(s), no more, so bandwidth and packet volume is equal between them.

So, you need to disable RR or broadcast, whichever it is currently, on
the IO server, and switch it to Dynamic mode.  This will instantly kill
the flooding problem, stop the switch from sending PAUSE frames to the
IO server, and might eliminate the file/IO errors.  I'm not sure on
this
last one, as I've not seen enough information about the errors (or the
actual errors themselves).
OK, so I changed the linux iSCSI server to 802.3ad mode, and that killed all networking, so I changed the switch config to use LACP, and then that was working again.
I then tested single physical machine network performance (just a simple dd if=iscsi device of=/dev/null to read a few gig of data. I had some interesting results. Initially, each server individually could read around 120MB/s, so I tried 2 at the same time, and each got 120MB/s, so I tried three at a time, same result. Finally, testing 4 in parallel, two got 120MB/s and the other two got around 60MB/s. Eventually I worked out this:

Server    Switch port
1               6
2               5
3               7
4               7
5               7
6               7
7               7
8               6

So, for some reason, port 8 was never used, (unless I physically disconnected ports 5, 6 and 7). Also, a single port was shared for 5 machines, resulting in around 20MB/s for each (when testing all in parallel).

I eventually changed the iSCSI server to use xmit_hash_policy to 1 (layer3+4) instead of layer2 hashing. This resulted in a minor improvement as follows:
Server    Switch port
1               6
2               5
3               8
4               6
5               6
6               6
7               6
8               7

So now, I still have 5 machines sharing a single port, but the other three get a full port each. I'm not sure why the balancing is so poor... The port number should be the same for all machines (iscsi), but the IP's are consecutive (x.x.x.31 - x.x.x.38).

Anyway, so I've configured the DC on machine 2, the three testing servers and two of the TS on the "shared port" machines, and the third TS and DB server onto the remaining machines.

Any suggestions on how to better balance the traffic would be appreciated!!!
That said, disabling the Windows write
caching on the local drives backed by the iSCSI LUNs might fix this as
well.  It should never be left enabled in a configuration such as
yours.
Have now done this across all the windows servers for all iSCSI drives, left it enabled for the RAM drive with the pagefile
quoted
quoted
quoted
So, traffic seems reasonably well balanced across all four links
The storage server's transmit traffic is well balanced out of the
NICs, but the receive traffic from the switch is imbalanced, almost
3:1 between ports 5 and 7.  This is due to the switch doing ALB, and
helps us diagnose the problem.
The switch doesn't seem to have any setting to configure ALB or RR,
or at least I don't know what I'm looking for.... In any case, I suppose
if both sides of the network have equivalent bandwidth, then it should
be OK....
Let's see, I think you listed the switch model...  yes, GS716T-200

It does stock 802.3ad static and dynamic link aggregation, dynamic by
default it appears, so standard session based streams.  This is what
you want.
I'm assuming that is what I have now, but I didn't do write tests so I can't be sure the switch will properly balance the traffic back to the server
Ah, here you go.  It does have port based ingress/egress rate limiting.
So you should be able to slow down the terminal server hosts so no
single one can flood the DC.  Very nice.  I wouldn't have expected this
in this class of switch.
I don't know if I want to do this, as it will also limit SMB, RDP. etc traffic just as much.... I'll leave it for now, and perhaps come back to it if it is still an issue.
So, you can fix the network performance problem without expending any
money.  You'll just have on TS host and its users bogged down when
someone does a big file copy.  And if you can find a Windows policy to
limit IO per user, you can solve it completely.
I'll look into this later, but this is pretty much acceptable, the main issue is where one machine can impact other machines.
That said, I'd still get two or 4 bonded ports into that DC share
server to speed things up for everyone.
OK, I'll need to think about this one carefully. I wanted all the 8 machines to be identical so that we can do live migration of the virtual machines, and also if physical hardware fails, then it is easy to reboot a VM on another physical host. If I add specialised hardware, then it requires the VM to run on that host, (well, would still work on another host with reduced performance, which is somewhat acceptable, but not preferable since might end up trying to fix a hardware failure and a performance issue at the same time, or other random issues related to the reduced performance.
Add the 4 port to the DC if it'll work in the x16 slot, if not use two
of the single port PCIe x1 NICs I mentioned and bond them in 802.3ad
Dyaminc mode, same as with the IO server.  Look into Windows TS per
user IO rate limits.  If this capability exists, limit each user to 50MB/s.

And with that, you should have fixed all the network issues.  Combined
with the changes to the IO server, you should be all squared away.
OK, so apparently the motherboard on the physical machines will work fine with the dual or quad ethernet cards.

I'm not sure how this solves the problem though.

1) TS user asks the DC to copy file1 from the shareA to shareA in a different folder
2) TS user asks the DC to copy file1 from the shareA to shareB
3) TS user asks the DC to copy file1 from the shareA to local drive C:

In cases 1 and 2, I assume the DC will not actually send the file content over SMB, it will just do the copy locally, but the DC will read from the SAN at single ethernet speed and write to the san  at single ethernet speed,  since even if the DC uses RR to send the data at 2x1Gbps, the switch is LACP so will forward to the iSCSI server at 1Gbps. Hence, iSCSI is maxed out at 1Gbps... The iSCSI potentially can satisfy other servers if LACP is not making them share the same ethernet. The DC can possibly, if LACP happens to choose the second port, be able to maintain SMB/RDP traffic. but if LACP shares the same port, then the second ethernet is wasted.

Regardless of what number of network ports are on the physical machines, the SAN will only send/receive at a max of 1G per machine, so the DC is still limited to 1G total iSCSI bandwidth. If I use RR on the DC, then it has 2G write and only 1G read performance, which seems strange.

The more I think about this, the worse it seems to get... It almost seems I should do this:
1) iSCSI uses RR and switch uses LAG (LACP)
2) All physical machines have a dual ethernet and use RR, and the switch uses LAG (LACP)
3) On the iSCSI server, I configure some sort of bandwidth shaping, so that the DC gets 2Gbps, and all other machines get 1Gbps
4) On the physical machines, I configure some sort of bandwidth shaping so that all VM's other than the DC get limited to 1Gbps

This seems like a horrible, disgusting hack, and I would really hate myself for trying to implement it, and I don't know that Linux will be good at limiting speeds this fast including CPU overhead concerns, etc

I'm in a mess here, and not sure any of this makes sense...

How about:
1) Add dual port ethernet to each physical box
2) Use the dual port ethernet in RR to connect to the iSCSI
3) Use the onboard ethernet for the user network
4) Configure the iSCSI server in RR again

This means the TS and random desktop's get a full 1Gbps for SMB access, the same as they had when it was a physical machine
The DC gets a full 2Gbps access to the iSCSI server, the iSCSI server might send/flood the link, but I assume since there is only iSCSI traffic, we don't care.
The TS can also do 2Gbps to the iSCSI server, but again this is OK because the iSCSI has 4Gbps available
If a user copies a large file from the DC to local drive, it floods the 1G user LAN with SMB, which uses only 1Gbps on the iSCSI LAN for the DC, and 1Gbps for the TS on the iSCSI LAN (total 2Gbps on the iSCSI SAN).

To make this work, I need 8 x dual port cards, or in reality, 2 x 4port cards plus 4 x 2port cards (putting 4port cards into the san, and moving existing 2port cards), then I need a 48 port switch to connect everything up, and then I'm finished.

Add SATA card to the SAN, and I'm laughing.... sure, it's a chunk of new hardware, but it just doesn't seem to work right any other way I think about it.

So, purchase list becomes:
2 x 4port ethernet card $450 each
4 x 2port ethernet card $161 each
1 x 48 port switch (any suggestions?) $600
2 x LSI HBA  $780
Total Cost: $2924
Again, you have all the network hardware you need, so this is
completely unnecessary.  You just need to get what you have
configured correctly.
Everything above should be even more helpful.  My apologies for not
having precise LACP insight in my previous post.  It's been quite a
while and I was rusty, and didn't have time to refresh my knowledge
base before the previous post.
I don't see how LACP will make it better, well, it will stop sending pause commands, but other than that, it seems to limit the bandwidth to even less than 1Gbps. The question was asked if it would be worthwhile to just upgrade to 10Gbps network for all machines.... I haven't looked at costing on that option, but I assume it is really just the same problem anyway, either speeds are unbalanced if server has more bandwidth, or speeds are balanced if server has equal bandwidth/limited balancing with LACP asiide)

BTW, reading at www.kernel.org/doc/Documentation/networking/bonding.txt in chapter 12.1.1 I think maybe balance-alb might be a better solution? It sounds like it would at least do a better job at avoiding 5 machines being on the same link .... 
quoted
Just in case the budget dollars doesn't stretch that far, would it be
a reasonable budget option to do this:
Add 1 x 2port ethernet card to the DC machine
Add 7 x 1port ethernet card to the rest of the machines $32 (Intel
Pro 1000GT DT Adapter I 82541PI Low-Profile PCI)
Add 1 x 24port switch $300

Total Cost: $685
If the DC can take a PCIe x4 dual port card, that should work fine with
the reconfiguration I described above.  The rest of the gear in that
$685 is wasted--no gain.  Use part of the remaining balance for the LSI
9207-8i HBA.  That will make a big difference in throughput once you
get alignment and other issues identified and corrected, more than double
your current bandwidth and IOPS, making full time DRBD possible.
I will suggest the HBA anyway, might as well improve that now anyway, and it also adds options for future expansion (up to 8 x SSD's). 

I can't find that exact one, my supplier has suggested the LSI SAS 9211-8i pack for $390 or the LSI MegaRAID SAS 9240-8i pack for $429, is one of these equivalent/comparable?
quoted
I'm assuming this would stop sharing SMB/iSCSI on the same ports, and
improve the ability for the TS machines to at least talk to the DC
and
quoted
know the IO is "in progress" and hence reduce the data loss/failures?
Again this is all unnecessary once you implement the aforementioned
changes.  If the IO errors on the TS machines still occur the cause
isn't in the network setup.  Running CIFS(SMB)/iSCSI on the same port
is
done 24x7 by thousands of sites.  This isn't the cause of the TS IO
errors.  Congestion alone shouldn't cause them either, unless a Windows
kernel iSCSI packet timeout is being exceeded or something like that,
which actually seems pretty plausible given the information you've
provided.  I admit I'm no a Windows iSCSI expert.  If that is the case
then it should be solved by the mentioned LACP configuration and two
bonded ports on the DC box.
I suspect a part of all this was caused by the write caching on the windows drives, so hopefully that situation will improve now.

When doing the above dd tests, I noticed one machine would show 2.6GB/s for the second or subsequent reads (ie, cached) while all the other machines would show consistent read speeds equivalent to uncached speeds. If this one machine had to read large enough data (more than RAM) then it dropped back to normal expected uncached speeds. I worked out this machine I had experimented with installing multipath-tools, so I installed this on all other machines, and hopefully it will allow improved performance through caching of the iSCSI devices.

I haven't done anything with the partitions as yet, but are you basically suggesting the following:
1) Make sure the primary and secondary storage servers are in sync and running
2) Remove one SSD from the RAID5, delete the partition, clear the superblock/etc
3) Add the same SSD back as /dev/sdx instead of /dev/sdx1
4) Wait for sync
5) Go to 2 with the next SSD etc

This would move everything to the beginning of the disk by a small amount, but not change anything relatively regarding DRBD/LVM/etc .... 

Would I then need to do further tests to see if I need to do something more to move DRBD/LVM to the correct offset to ensure alignment? How would I test if that is needed?
quoted
quoted
Keep us posted.
Will do, I'll have to price up the above options, and get approval
for
quoted
purchase, and then will take a few days to get it all in place/etc...
Given the temperature under the collar of the client, I'd simply spend
on adding the 2 bonded ports to the DC box, make all of the LACP
changes, and straighten out alignment/etc issues on the SSDs, md stripe
cache, etc.  This will make substantial gains.  Once the client sees
the
positive results, then recommend the HBA for even better performance.
Remember, Intel's 520 SSD data shows nearly double the performance
using
SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
moving to the LSI should nearly double your block throughput.
I;d prefer to do everything at once, then they will only pay once, and they should see a massive improvement in one jump. Smaller incremental improvement is harder for them to see..... Also, the HBA is not so expensive, I always assumed they were at least double or more in price....

Apologies if the above is 'confused', but I am :)

PS, was going to move one of the dual port cards from the secondary san to the DC machine, but haven't yet since I don't have enough switch ports, and now I'm really unsure whether what I have done will be an improvement anyway. Will find out tomorrow....

Summary of changes (more for my own reference in case I need to undo it tomorrow):
1) disable disk cache on all windows machines
2) san1/2 convert from balance-rr to 802.3ad and add xmit_hash_policy=1
3) change switch LAG from Static to LACP
4) install multipath-tools on all physical machines (no config, just a reboot)

Thanks,
Adam


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help