Thread (11 messages) 11 messages, 2 authors, 2021-07-09

Re: [PATCH for-rc] IB/cma: Fix false P_Key mismatch messages

From: Haakon Bugge <hidden>
Date: 2021-07-09 16:45:30

On 8 Jul 2021, at 20:52, Jason Gunthorpe [off-list ref] wrote:

On Thu, Jul 08, 2021 at 03:59:25PM +0000, Haakon Bugge wrote:
quoted
quoted
On 5 Jul 2021, at 18:59, Haakon Bugge [off-list ref] wrote:


quoted
On 5 Jul 2021, at 18:26, Jason Gunthorpe [off-list ref] wrote:

On Tue, Jun 29, 2021 at 01:45:35PM +0000, Haakon Bugge wrote:
quoted
quoted
quoted
quoted
IMHO it is a bug on the sender side to send GMPs to use a pkey that
doesn't exactly match the data path pkey.
The active connector calls ib_addr_get_pkey(). This function
extracts the pkey from byte 8/9 in the device's bcast
address. However, RFC 4391 explicitly states:
pkeys in CM come only from path records that the SM returns, the above
should only be used to feed into a path record query which could then
return back a limited pkey.

Everything thereafter should use the SM's version of the pkey.
Revisiting this. I think I mis-interpreted the scenario that led to
the P_Key mismatch messages.

The CM retrieves the pkey_index that matched the P_Key in the BTH
(cm_get_bth_pkey()) and thereafter calls ib_get_cached_pkey() to get
the P_Key value of the particular pkey_index.

Assume a full-member sends a REQ. In that case, both P_Keys (BTH and
primary path_rec) are full. Further, assume the recipient is only a
limited member. Since full and limited members of the same partition
are eligible to communicate, the P_Key retrieved by
cm_get_bth_pkey() will be the limited one.
It is incorrect for the issuer of the REQ to put a full pkey in the
REQ message when the target is a limited member.
Sorry, I mis-interpreted the spec. I though the PKey in the Path record should be that of the initiator, not the target's. OK. Will come up with a fix.
On the systems I have access to (running Oracle flavour OpenSM in
our NM2 switches), the behaviour is exactly the opposite of what you
say.
Check with saquery what is happening, if you request a reversible path
from the CM target (limited pkey) to the CM client (full) you should
get the limited pkey or the SM is broken.

If the SM is working then probably something in the stack is using a
reversed src/dest when doing the PR query.

It is not intuitive but the PR query should have SGID as the CM Target
even though it is running on the CM Client.
That is not how it is today. And because of that, all accesses to the PR assume the d{gid,lid} is the remote peer. To fix this, I have to swap dgid/sgid and ib.dlid/ib.slid all over to get this working. That is pervasive. E.g., even includes ipoib. Let me know if that is what you want.


Thxs, Håkon
This is because the REQ is supposed to contain a path that is relative
to the target.

Everything will be the same except for this small detail about
full/limited pkeys.

The client can figure out what to do with its own pkey table locally.
quoted
"the P_Key table entry (0x1234) matching incoming BTH.P_Key differs from primary path P_Key (0x9234)"
"The REQ contains a PKey (0x1234) that is not found in this device's
PKey table. Using alternative limited Pkey (0x9234) instead. This is a
client bug"

Jason
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help