Thread (33 messages) 33 messages, 5 authors, 2022-10-11

Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters

From: Wilczynski, Michal <hidden>
Date: 2022-09-23 15:47:11


On 9/23/2022 3:16 PM, Jakub Kicinski wrote:
On Fri, 23 Sep 2022 14:11:08 +0200 Wilczynski, Michal wrote:
quoted
On 9/22/2022 10:29 PM, Jakub Kicinski wrote:
quoted
On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote:
quoted
On 9/22/2022 2:50 PM, Jakub Kicinski wrote:
I'm not sure whether this is allowed on mailing list, but I'm attaching
a text file  with an ASCII drawing representing a tree I've send
previously as linear. Hope you'll find this easier to read.
That helps, thanks! So what I was saying was anything under the vport
layer should be configured by the policy local to the owner of the
function.
My main concern is that there are no interfaces to do so. tc-htb in it's 
current state just doesn't
work for us, as I noted before their whole implementation is about 
creating new queues.
https://legacy.netdevconf.info/0x14/pub/papers/44/0x14-paper44-talk-paper.pdf

Also reconfiguration from the VM, would need to be handled by the VF 
driver i.e iavf.
So the solution would get much more complex I guess, since we would need 
to implement
communication between ice-iavf, through virtchnl I guess.

quoted
quoted
quoted
We tried already tc-htb, and it doesn't work for a couple of reasons,
even in this potential hybrid with devlink-rate. One of the problems
with tc-htb offload is that it forces you to allocate a new
queue, it doesn't allow for reassigning an existing queue to another
scheduling node. This is our main use case.

Here's a discussion about tc-htb:
https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/ (local)
This is a problem only for "SR-IOV case" or also for just the PF?
The way tc-htb is coded it's NOT possible to reassign queues from one
scheduling node to the other, this is a generic problem with this
implementation, regardless of SR-IOV or PF. So even if we
wanted to reassign queues only for PF's this wouldn't be possible.
I feel like an example would help. So let's say I do this:

tc qdisc replace dev ens785 root handle 1: htb offload
tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000

                    1:    <-- root qdisc
                    |
                   1:2
                   / \
                  /   \
                1:3   1:4
                 |     |
                 |     |
                1:5   1:6
                 |     |
                QID   QID   <---- here we'll have PFIFO qdiscs


At this point I would have two additional queues in the system, and
the kernel would enqueue packets to those new queues according to 'tc
flower' configuration.
TBH I don't know what you mean by "reassign queues from one
scheduling node to the other", sorry I don't know this code well.
Neither the offload nor HTB itself.
So using as an example parts of the drawing I made:
Imagine you have a queue like this:
pci/0000:4b:00.0/queue/103: type queue parent node_200
It has txq_id 103, that is assigned by hardware - it's uniquely id'd by it.


Then I run commands like this:
devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent 
vport_2 tx_share 100Mbps tx_max 500Mbps priority 5
devlink port function rate add pci/0000:4b:00.0/node_custom_2 parent 
node_custom_1
devlink port function rate add pci/0000:4b:00.0/node_custom_3 parent 
node_custom_3

And here I reassign the queue:
devlink port function rate set pci/0000:4b:00.0/queue/103 parent 
node_custom_3

So now queue has completely different parent and the packets from that 
queue are scheduled
using completely different parameters.

My uneducated anticipation of how HTB offload would work is that
queue 0 of the NIC is a catch all for leftovers and all other queues
get assigned leaf nodes.
So enabling an htb offload leaves all the existing queues in place, and 
in case the
kernel can't classify a packet to one of the newly created queues, the 
driver is supposed
to select one of the 'older' queues to do the job. (driver receives 
TC_HTB_LEAF_QUERY_QUEUE
event)
So basically after running
tc qdisc replace dev ens785 root handle 1: htb offload
you don't get any new queues yet, but if you create new classes:
tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
you'll get an TC_HTB_LEAF_ALLOC_QUEUE event in the driver,
that means you're supposed to allocate new queue in the driver.

This just doesn't work for our case. Also our scheduling tree is rather 
rigid. I can't just remove the whole
subtree just because user decided to enable htb-offload. So as you can 
see in current ultimate
devlink-rate implementation I'm exporting the whole tree, and just allow 
user to modify it. There
are some constraints, like inability to remove nodes with queues, or any 
children really.


quoted
So theoretically we should create a new queue
in a hardware and put it in a privileged position in the scheduling
tree. And I would happily write it this
way, but this is NOT what our customer want. He doesn't want any
extra queues in the system, he just
wants to make existing queues more privileged. And not just PF queues
- he's mostly interested in VF queues.
I'm not sure how to state use case more clearly.
The VF means controlling queue scheduling of another function
via the PF, right? Let's leave that out of the picture for now
so we don't have to worry about "architectural" concerns.
Devlink works more on pci devices, but I guess you can call it PF still.
quoted
quoted
quoted
So either I would have to invent a new offload type (?) for tc, or
completely rewrite and
probably break tc-htb that mellanox implemented.
Also in our use case it's possible to create completely new
branches from the root and
reassigning queues there. This wouldn't be possible with the method
you're proposing.

So existing interface doesn't allow us to do what is required.
For some definition of "what is required" which was not really
disclosed clearly. Or I'm to slow to grasp.
In most basic variant what we want is a way to make hardware queues
more privileged, and modify hierarchy of nodes/queues freely. We
don't want to create new queues, as required by tc-htb
implementation. This is main reason why tc-htb and devlink-rate
hybrid doesn't work for us.
Hm, when you say "privileged" do you mean higher quota or priority?
Maybe the meaning wasn't clear, so queue privilegeness is determined by 
everything really:
it's position in the tree, it's ancestors assigned parameters (tx_share, 
tx_max, priority, weight).
And it's own assigned parameters.

BR,
Michał
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help