Thread (20 messages) 20 messages, 7 authors, 2010-09-30

Re: RFC: MTU for serving NFS on Infiniband

From: Chuck Lever <chuck.lever@oracle.com>
Date: 2010-08-26 14:59:24
Also in: lkml

On Aug 26, 2010, at 7:40 AM, Marc Aurele La France wrote:
On Tue, 24 Aug 2010, Stephen Hemminger wrote:
quoted
On Tue, 24 Aug 2010 23:20:41 +0100
Ben Hutchings [off-list ref] wrote:
quoted
On Tue, 2010-08-24 at 13:49 -0600, Marc Aurele La France wrote:
quoted
On Tue, 24 Aug 2010, Ben Hutchings wrote:
quoted
On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
quoted
On Mon, 23 Aug 2010, Stephen Hemminger wrote:
quoted
On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
Marc Aurele La France [off-list ref] wrote:
quoted
In regrouping for my next tack at this, I noticed that all stack traces go
through ip_append_data().  This would be ipv6_append_data() in the IPv6 case.
A _very_ rough draft that would have ip_append_data() temporarily drop down
to a smaller fake MTU follows ...
quoted
quoted
quoted
quoted
quoted
quoted
Why doesn't NFS generate page size fragments?  Does Infiniband or your
device not support this?  Any thing that requires higher order allocation
is going to unstable under load.  Let's fix the cause not the apply bandaid
solution to the symptom.
quoted
quoted
quoted
quoted
quoted
From what I can tell, IP fragmentation is done centrally.
quoted
quoted
quoted
quoted
Stephen and I are not talking about IP fragmentation, but about the
ability to append 'fragments' to an skb rather than putting the entire
packet payload in a linear buffer.  See
<http://vger.kernel.org/~davem/skb_data.html>.
quoted
quoted
quoted
Any payload has to either fit in the MTU, or has to be broken up into
MTU-sized (or less) fragments, come hell or high water.  That this is done
centrally is a good thing.
quoted
quoted
Not necessarily.  Offloading it to hardware, where possible, is usually
a performance win.
ip_append_data() deals with that already.
quoted
quoted
quoted
It is the "(or less)" part that I am working towards here.
quoted
quoted
The inability to allocate large linear buffers is not a good reason to
generate packets smaller than the MTU.
Generating smaller-than-MTU fragments is better than giving up and returning an error in my book.
quoted
IF NFS server is smart enough to generate:
 Header (skb) + one or more pages in fragment list
then IP fragmentation could do fragmentation by allocating
new headers skb (small) and assigning the same pages to
multiple skb's using page ref count.
quoted
It obviously isn't working that way.
Point of clarification:  we're talking about the client here, not the server.  But, yes, it doesn't work that way.
quoted
The whole problem is moot because NFS over UDP has known data corruption
issues in the face of packet loss.  The sequence number of the IP fragment
can easily wrap around causing old data to be grouped with new data and
the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
client ans passed to the user application as corrupted disk block.
quoted
DON'T USE NFS OVER UDP!
Steady now.  There's no need to YELL nor be arrogant.  You and I both know there's a place for NFS over UDP.  That's not changing any time soon.  While I'm aware of the issue you brought up, it is separate from the one at hand in this discussion.

I do want to thank you, however, for reminding me of TCP.  It's something 20/20 hindsight says I should have checked out before starting this thread. Logistically, it'll be a few days before I can do so though.  If that allows me to increase the MTU all the way up to 65520, then this UDP thing will likely remain unresolved.
On advanced cluster-area networks with large MTUs, the ACK packets in TCP will probably kill your performance.  That's one of the main reasons we keep NFS over UDP on life support!  :-)

-- 
chuck[dot]lever[at]oracle[dot]com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help