Re: RDMA will be reverted

RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-06-28
Re: RDMA will be reverted · Evgeniy Polyakov <hidden> · 2006-06-28
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-28
Re: RDMA will be reverted · Steve Wise <hidden> · 2006-06-28
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-06-29
Re: RDMA will be reverted · YOSHIFUJI Hideaki / 吉藤英明 <hidden> · 2006-06-29
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-06-29
Re: RDMA will be reverted · YOSHIFUJI Hideaki / 吉藤英明 <hidden> · 2006-06-29
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-06-29
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-29
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-29
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-06-29
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-29
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-06-29
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-29
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-06-29
Re: RDMA will be reverted · James Morris <jmorris@namei.org> · 2006-06-29
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-06-30
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-06-30
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-06-30
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-01
Re: RDMA will be reverted · Andy Gay <hidden> · 2006-07-04
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-04
Re: RDMA will be reverted · Andy Gay <hidden> · 2006-07-04
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-04
Re: RDMA will be reverted · Andy Gay <hidden> · 2006-07-04
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-05
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-07-04
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-24
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-24
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-24
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-25
Re: RDMA will be reverted · Rick Jones <hidden> · 2006-07-25
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-25
Re: RDMA will be reverted · Rick Jones <hidden> · 2006-07-25
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-25
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-25
Re: RDMA will be reverted · Rick Jones <hidden> · 2006-07-25
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-25
Re: RDMA will be reverted · Rick Jones <hidden> · 2006-07-25
Re: RDMA will be reverted · Andi Kleen <hidden> · 2006-07-25
Re: RDMA will be reverted · Evgeniy Polyakov <hidden> · 2006-07-25
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-25
Re: RDMA will be reverted · Evgeniy Polyakov <hidden> · 2006-07-25
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-25
Re: RDMA will be reverted · Evgeniy Polyakov <hidden> · 2006-07-25
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-07-05
Re: RDMA will be reverted · Steve Wise <hidden> · 2006-07-05
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-24
RE: RDMA will be reverted · Caitlin Bestler <hidden> · 2006-07-24
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-24
RE: RDMA will be reverted · Caitlin Bestler <hidden> · 2006-07-24
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-01
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-07-04
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-05
Re: RDMA will be reverted · Roland Dreier <hidden> · 2006-07-05
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-06
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-07-06
Re: RDMA will be reverted · Herbert Xu <herbert@gondor.apana.org.au> · 2006-07-06
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-07-06
Re: RDMA will be reverted · Herbert Xu <herbert@gondor.apana.org.au> · 2006-07-07
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-07-07
Re: RDMA will be reverted · David Miller <davem@davemloft.net> · 2006-07-07
What is RDMA (was: RDMA will be reverted) · Herbert Xu <herbert@gondor.apana.org.au> · 2006-07-07
Re: What is RDMA (was: RDMA will be reverted) · Steve Wise <hidden> · 2006-07-07
Re: What is RDMA (was: RDMA will be reverted) · Herbert Xu <herbert@gondor.apana.org.au> · 2006-07-11
Re: What is RDMA (was: RDMA will be reverted) · Steve Wise <hidden> · 2006-07-11
Re: What is RDMA · David Miller <davem@davemloft.net> · 2006-07-24
Re: What is RDMA · Rick Jones <hidden> · 2006-07-24
Re: What is RDMA · David Miller <davem@davemloft.net> · 2006-07-24
Re: What is RDMA · Andi Kleen <hidden> · 2006-07-24
Re: RDMA will be reverted · Tom Tucker <hidden> · 2006-07-07

From: Andi Kleen <hidden>
Date: 2006-07-24 23:11:09

For example, my idea to allow ESTABLISHED TCP socket demux to be done
before netfilter is flawed.  Connection tracking and NAT can change
the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
socket, therefore we must always hit netfilter first.

Hmm, how does this happen?

I guess either when a connection is masqueraded and an application did a bind()
on a local port that is used by the masquerading engine.  That could be handled
by just disallowing it.

Or when you have a transparent proxy setup with the proxy on the local host.
Perhaps in that case netfilter could be taught to reinject packets
in a way that they hit another ESTABLISHED lookup.

Did I miss a case?

All the original costs of route, netfilter, TCP socket lookup all
reappear as we make VJ netchannels fit all the rules of real practical
systems, eliminating their gains entirely.

At least most of the optimizations from the early demux scheme could
be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
that checks if all rules only check SYN etc. packets and doesn't walk the
full rules then (or more generalized a fast TCP flag mask check similar 
to what TCP does). With that ESTABLISHED would hit TCP only with relatively
small overhead.

I will also note in 
passing that papers on related ideas, such as the Exokernel stuff, are
very careful to not address the issue of how practical 1) their demux
engine is and 2) the negative side effects of userspace TCP
implementations.  For an example of the latter, if you have some 1GB
JAVA process you do not want to wake that monster up just to do some
ACK processing or TCP window updates, yet if you don't you violate
TCP's rules and risk spurious unnecessary retransmits.

I don't quite get why the size of the process matters here - if only
some user space TCP library is called directly then it shouldn't
really matter how big or small the rest of the process is.

Or did you mean migration costs as described below?

But on the other hand full user space TCP seems to me of little gain
compared to a process context implementation.

I somehow like it better to hide these implementation details in 
the kernel.

Furthermore, the VJ netchannel gains can be partially obtained from
generic stateless facilities that we are going to get anyways.
Networking chips supporting multiple MSI-X vectors, choosen by hashing
the flow ID, can move TCP processing to "end nodes" which are cpu
threads in this case, by having each such MSI-X vector target a
different cpu thread.

The problem with the scheme is that to do process context processing
effectively you would need to teach the scheduler to aggressively
migrate on wake up (so that the process ends up on the CPU that 
was selected by the hash function in the NIC).

But what do you do when you have lots of different connections
with different target CPU hash values or when this would require
you to move multiple compute intensive processes or a single core?

Without user context TCP, but using softirqs instead, it looks a bit better 
because you can at least use different CPUs to do the ACK processing etc.
and the hash function spreading out connections over your CPUs doesn't harm.

But you still have relatively high cache line transfer costs in handing
over these packet from the softirq CPUs to the final process consumer. I liked
VJ's idea of using arrays-of-something instead of lists for that to avoid
some cache line transfers.  Ok at least it sounds nice in theory - haven't seen any 
hard numbers on this scheme compared to a traditional double linked list.

-Andi

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help