Re: [VXLAN] [MLX5] Lost traffic and issues
From: Ian Kumlien <hidden>
Date: 2020-03-04 09:48:06
On Tue, Mar 3, 2020 at 11:23 AM Ian Kumlien [off-list ref] wrote:
On Mon, Mar 2, 2020 at 11:45 PM Ian Kumlien [off-list ref] wrote:quoted
On Mon, Mar 2, 2020 at 8:10 PM Saeed Mahameed [off-list ref] wrote:[... 8< ...]quoted
quoted
What type of mlx5 configuration you have (Native PV virtualization ? SRIOV ? legacy mode or switchdev mode ? )We have: tap -> bridge -> ovs -> bond (one legged) -switch-fabric-> <other-end> So a pretty standard openstack setupOh, the L3 nodes are also MLX5s (50gbit) and they do report the lag map thing [ 37.389366] mlx5_core 0000:04:00.0 ens1f0: S-tagged traffic will be dropped while C-tag vlan stripping is enabled [77126.178520] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [77131.485189] mlx5_core 0000:04:00.0 ens1f0: Link down [77337.033686] mlx5_core 0000:04:00.0 ens1f0: Link up [77344.338901] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [78098.028670] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [78103.479494] mlx5_core 0000:04:00.0 ens1f0: Link down [78310.028518] mlx5_core 0000:04:00.0 ens1f0: Link up [78317.797155] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [78504.893590] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [78511.277529] mlx5_core 0000:04:00.0 ens1f0: Link down [78714.526539] mlx5_core 0000:04:00.0 ens1f0: Link up [78720.422078] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [78720.838063] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [78727.226433] mlx5_core 0000:04:00.0 ens1f0: Link down [78929.575826] mlx5_core 0000:04:00.0 ens1f0: Link up [78935.422600] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [79330.519516] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [79330.831447] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [79336.073520] mlx5_core 0000:04:00.1 ens1f1: Link down [79336.279519] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [79541.272469] mlx5_core 0000:04:00.1 ens1f1: Link up [79546.664008] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [82107.461831] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [82113.859238] mlx5_core 0000:04:00.1 ens1f1: Link down [82320.458475] mlx5_core 0000:04:00.1 ens1f1: Link up [82327.774289] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [82490.950671] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [82497.307348] mlx5_core 0000:04:00.1 ens1f1: Link down [82705.956583] mlx5_core 0000:04:00.1 ens1f1: Link up [82714.055134] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [83100.804620] mlx5_core 0000:04:00.0 ens1f0: Link down [83100.860943] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [83319.953296] mlx5_core 0000:04:00.0 ens1f0: Link up [83327.984559] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [83924.600444] mlx5_core 0000:04:00.0 ens1f0: Link down [83924.656321] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [84312.648630] mlx5_core 0000:04:00.0 ens1f0: Link up [84319.571326] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [84946.495374] mlx5_core 0000:04:00.1 ens1f1: Link down [84946.588637] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [84946.692596] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [84949.188628] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [85363.543475] mlx5_core 0000:04:00.1 ens1f1: Link up [85371.093484] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 [624051.460733] mlx5_core 0000:04:00.0: modify lag map port 1:2 port 2:2 [624053.644769] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:1 [624053.674747] mlx5_core 0000:04:00.0: modify lag map port 1:1 port 2:2 Sorry, it's been a long couple of weeks ;)
I made them one-legged but it doesn't seem to help Someone also posted this: https://marc.info/?l=linux-netdev&m=158330796503347&w=2 While I don't use IPVS - I do use VXLAN and if checksums are incorrectly tagged the nic might drop it?
quoted
quoted
The only change that i could think of is the lag multi-path support we added, Roi can you please take a look at this ?I'm also trying to get a setup working where i could try reverting changes but so far we've only had this problem with mlx5_core... Also the intermittent but reliable patterns are really weird... All traffic seems fine, except vxlan traffic :/ (The problem is that the actual machines that has the issue is in production with 8x V100 nvidia cards... Kinda hard to justify having them "offline" ;))