Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
From: Ananyev, Konstantin <hidden>
Date: 2021-06-14 11:36:52
On 2021/6/9 18:25, Ananyev, Konstantin wrote:quoted
quoted
quoted
On 4/23/21 12:46 PM, Chengchang Tang wrote:quoted
To use the HW offloads capability (e.g. checksum and TSO) in the Tx direction, the upper-layer users need to call rte_eth_dev_prepare to do some adjustment to the packets before sending them (e.g. processing pseudo headers when Tx checksum offoad enabled). But, the tx_prepare callback of the bond driver is not implemented. Therefore, related offloads can not be used unless the upper layer users process the packet properly in their own application. But it is bad for the transplantability. However, it is difficult to design the tx_prepare callback for bonding driver. Because when a bonded device sends packets, the bonded device allocates the packets to different slave devices based on the real-time link status and bonding mode. That is, it is very difficult for the bonding device to determine which slave device's prepare function should be invoked. In addition, if the link status changes after the packets are prepared, the packets may fail to be sent because packets allocation may change. So, in this patch, the tx_prepare callback of bonding driver is not implemented. Instead, the rte_eth_dev_tx_prepare() will be called for all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all tx_offloads can be processed correctly for all NIC devices in these modes. If tx_prepare is not required in some cases, then slave PMDs tx_prepare pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP. In these cases, the impact on performance will be very limited. It is the responsibility of the slave PMDs to decide when the real tx_prepare needs to be used. The information from dev_config/queue_setup is sufficient for them to make these decisions. Note: The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is because in broadcast mode, a packet needs to be sent by all slave ports. Different PMDs process the packets differently in tx_prepare. As a result, the sent packet may be incorrect. Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> --- drivers/net/bonding/rte_eth_bond.h | 1 - drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++---- 2 files changed, 24 insertions(+), 5 deletions(-)diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h index 874aa91..1e6cc6d 100644 --- a/drivers/net/bonding/rte_eth_bond.h +++ b/drivers/net/bonding/rte_eth_bond.h@@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id, int rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id); - #ifdef __cplusplus } #endifdiff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c index 2e9cea5..84af348 100644 --- a/drivers/net/bonding/rte_eth_bond_pmd.c +++ b/drivers/net/bonding/rte_eth_bond_pmd.c@@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs, /* Send packet burst on each slave device */ for (i = 0; i < num_of_slaves; i++) { if (slave_nb_pkts[i] > 0) { + int nb_prep_pkts; + + nb_prep_pkts = rte_eth_tx_prepare(slaves[i], + bd_tx_q->queue_id, slave_bufs[i], + slave_nb_pkts[i]); +Shouldn't it be called iff queue Tx offloads are not zero? It will allow to decrease performance degradation if no Tx offloads are enabled. Same in all cases below.Regarding this point, it has been discussed in the previous RFC: https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/ According to the TX_OFFLOAD status of the current device, PMDs can determine whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare to NULL, so that the actual tx_prepare processing will be skipped directly in rte_eth_tx_prepare().quoted
quoted
num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id, - slave_bufs[i], slave_nb_pkts[i]); + slave_bufs[i], nb_prep_pkts);In fact it is a problem here and really big problems. Tx prepare may fail and return less packets. Tx prepare of some packet may always fail. If application tries to send packets in a loop until success, it will be a forever loop here. Since application calls Tx burst, it is 100% legal behaviour of the function to return 0 if Tx ring is full. It is not an error indication. However, in the case of Tx prepare it is an error indication.Yes, that sounds like a problem and existing apps might be affected.quoted
quoted
Should we change Tx burst description and enforce callers to check for rte_errno? It sounds like a major change...Agree, rte_errno for tx_burst() is probably a simplest and sanest way, but yes, it is a change in behaviour and apps will need to be updated. Another option for bond PMD - just silently free mbufs for which prepare() fails (and probably update some stats counter). Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled. Also as, I can see some tx_burst() function for that PMD already free packets silently: bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast(). Actually another question - why the patch adds tx_prepare() only to some TX modes but not all? Is that itended?Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode with limited impact on performance. In broadcast mode, same packets will be send in several devices. In this process, we only update the ref_cnt of mbufs, but no copy of packets. As we know, tx_prepare() may change the data, so it may cause some problem if we perform tx_prepare() several times on the same packet.
You mean tx_prepare() for second dev can void changes made by tx_prepare() for first dev? I suppose in theory it is possible, even if it is probably not the case right now in practise (at least I am not aware about such cases). Actually that's an interesting topic - same can happen even with user implementing multicast on his own (see examples/ipv4_multicast/). I think these new limitations have to be documented clearly (at least). Also probably we need extra changes fo bond device dev_confgiure()/dev_get_info(): to check currently selected mode and based on that allow/reject tx offloads. The question arises (again) how to figure out for which tx offloads dev->tx_prepare() modifies the packet, for which not? Any thoughts here?
quoted
quoted
I agree that if the failure is caused by Tx ring full, it is a legal behaviour. But what about the failure caused by other reasons? At present, it is possible for some PMDs to fail during tx_burst due to other reasons. In this case, repeated tries to send will also fail. I'm not sure if all PMDs need to support the behavior of sending packets in a loop until it succeeds. If not, I think the current problem can be reminded to the user by adding a description to the bonding. If it is necessary, I think the description of tx_burst should also add related instructions, so that the developers of PMDs can better understand how tx_burst should be designed, such as putting all hardware-related constraint checks into tx_prepare. And another prerequisite for the above behavior is that the packets must be prepared (i.e. checked by rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have to use rte_eth_tx_prepare() in more scenarios. What's Ferruh's opinion on this?quoted
[snip] .