Re: [RFC] net: store port/representative id in metadata_dst

From: Jakub Kicinski <hidden>
Date: 2016-09-23 20:17:34

On Fri, 23 Sep 2016 10:22:59 -0700, Samudrala, Sridhar wrote:

On 9/23/2016 8:29 AM, Jakub Kicinski wrote:

quoted

On Fri, 23 Sep 2016 07:23:26 -0700, John Fastabend wrote:

quoted

Yep, I like the idea in general. I had a slightly different approach in
mind though. If you look at __dev_queue_xmit() there is a void
accel_priv pointer (gather you found this based on your commit note).
My take was we could extend this a bit so it can be used by the VFR
devices and they could do a dev_queue_xmit_accel(). In this way there is
no need to touch /net/core/{filter, dst, ip_tunnel}.c etc. Maybe the
accel logic needs to be extended to push the priv pointer all the way
through the xmit routine of the target netdev though. This should look
a lot like the macvlan accelerated xmit device path without the
switching logic.

Of course maybe the name would be extended to dev_queue_xmit_extended()
or something.

So the flow on ingress would be,

   1. pkt_received_by_PF_netdev
   2. PF_netdev reads some tag off packet/descriptor and sets correct
      skb->dev field. This is needed so stack "sees" packets from
      correct VF ports.
   3. packet passed up to stack.

I guess it is a bit "zombie" like on the receive path because the packet
is never actually handled by VF netdev code per se and on egress can
traverse both the VFR and PF netdevs qdiscs. But on the other hand the
VFR netdevs and PF netdevs are all in the same driver. Plus using a
queue per VFR is a bit of a waste as its not needed and also hardware
may not have any mechanism to push VF traffic onto a rx queue.

On egress,

   1. VFR xmit is called
   2. VFR xmit calls dev_queue_xmit_accel() with some meta-data if needed
      for the lower netdev
   3. lower netdev sends out the packet.

Again we don't need to waste any queues for each VFR and the VFR can be
a LLTX device. In this scheme I think you avoid much of the changes in
your patch and keep it all contained in the driver. Any thoughts?

The 'accel' parameter in dev_queue_xmit_accel() is currently only passed
to ndo_select_queue() via netdev_pick_tx() and is used to select the tx 
queue.
Also, it is not passed all the way to the driver specific xmit routine.  
Doesn't it require
changing all the driver xmit routines if we want to pass this parameter?

quoted

Goes without saying that you have a much better understanding of packet
scheduling so please bear with me :)  My target model is that I have
n_cpus x "n_tc/prio" queues on the PF and I want to transmit the
fallback traffic over those same queues.  So no new HW queues are used
for VFRs at all.  This is a reverse of macvlan offload which AFAICT has
"bastard hw queues" which actually TX for a separate software device.

My understanding was that I can rework this model to have software
queues for VFRs (#sw queues == #PF queues + #VFRs) but no extra HW
queues (#hw queues == #PF queues) but then when the driver sees a
packet on sw-only VFR queue it has to pick one of the PF queues (which
one?), lock PF software queue to own it, and only then can it
transmit.  With the dst_metadata there is no need for extra locking or
queue selection.

Yes.  The VFPR netdevs don't have any HW queues associated with them and 
we would like
to use the PF queues for the xmit.
I was also looking into some way of passing the port id via skb 
parameter to the
dev_queue_xmit() call so that the PF xmit routine can do a directed 
transmit to a specifc VF.
Is skb->cb an option to pass this info?
dst_metadata approach would work  too if it is acceptable.

I don't think we can trust skb->cb to be set to anything meaningful
when the skb is received by the lower device.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help