Re: RXE status in the upstream rping using rxe
From: Zhu Yanjun <zyjzyj2000@gmail.com>
Date: 2021-08-17 02:28:30
On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia [off-list ref] wrote:
On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun [off-list ref] wrote:quoted
On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky [off-list ref] wrote:quoted
On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:quoted
On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun [off-list ref] wrote:quoted
On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky [off-list ref] wrote:quoted
Hi, Can you please help me to understand the RXE status in the upstream? Does we still have crashes/interop issues/e.t.c?I made some developments with the RXE in the upstream, from my usage with latest RXE, I found the following: 1. rdma-core can not work well with latest RDMA git;The latest RDMA git is https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git"Latest" is a relative term, what SHA did you test? Let's focus on fixing RXE before we will continue with new features.Thanks a lot. I agree with you.I believe simple rping still doesn't work linux-to-linux. The last working version (of rping in rxe) was 5.13 I think. I have posted a number of crashes rping encounters (gotta get that working before I can even try NFSoRDMA).
The following are my tests. 1. Modprobe rdma_rxe 2. Modprobe -v -r rdma_rxe 3. Rdma link add rxe 4. Rdma link del rxe 5. Latest rdma-core && latest kernel upstream; 6. Latest kernel < ------rping---- > 5.10.y stable 7. Latest kernel < ------rping---- > 5.11.y stable 8. Latest kernel < ------rping---- > 5.12.y stable 9. Latest kernel < ------rping---- > 5.13.y stable It seems that the latest kernel upstream (5.14-rc6) can rping other stable kernels. Can you make tests again? Zhu Yanjun
Thank you for working on the code. We (NFS community) do test NFSoRDMA every git pull using rxe and siw but lately have been encountering problems.quoted
rdma-core: 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #1038 from selvintxavier/master 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr 327d45e0 tests: Add missing MAC element to args list 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices 8754fb51 bnxt_re/lib: Use separate indices for shadow queue be4d8abf bnxt_re/lib: add a function to initialize software queue kernel rdma: 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) RDMA/qedr: Improve error logs for rdma_alloc_tid error return 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc 991c4274dc17 RDMA/hfi1: Fix typo in comments 8d7e415d5561 docs: Fix infiniband uverbs minor number bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure that requests are valid bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on hfi1_devdata->user_refcount with the above kernel and rdma-core, the following messages will appear. " [ 54.214608] rdma_rxe: loaded [ 54.217089] infiniband rxe0: set active [ 54.217101] infiniband rxe0: added enp0s8 [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 [ 169.074796] rdma_rxe: qp#27 moved to error state [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de [ 169.138889] rdma_rxe: qp#30 moved to error state [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 [ 169.160601] rdma_rxe: qp#31 moved to error state [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 [ 169.182170] rdma_rxe: qp#32 moved to error state [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 [ 169.667850] rdma_rxe: qp#39 moved to error state [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 [ 200.332086] rdma_rxe: qp#58 moved to error state [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d [ 200.396514] rdma_rxe: qp#61 moved to error state [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 [ 200.417956] rdma_rxe: qp#62 moved to error state [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 [ 200.439654] rdma_rxe: qp#63 moved to error state [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 [ 200.933153] rdma_rxe: qp#70 moved to error state [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d [ 208.360028] rdma_rxe: qp#89 moved to error state [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 [ 208.425675] rdma_rxe: qp#92 moved to error state [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 [ 208.447370] rdma_rxe: qp#93 moved to error state [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a [ 208.469550] rdma_rxe: qp#94 moved to error state [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 [ 208.956731] rdma_rxe: qp#100 moved to error state [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 [ 218.363808] rdma_rxe: qp#119 moved to error state [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 [ 218.429513] rdma_rxe: qp#122 moved to error state [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 [ 218.451481] rdma_rxe: qp#123 moved to error state [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 [ 218.473908] rdma_rxe: qp#124 moved to error state [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b [ 218.963641] rdma_rxe: qp#130 moved to error state [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 [ 235.305319] rdma_rxe: qp#149 moved to error state [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 [ 235.368838] rdma_rxe: qp#152 moved to error state [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d [ 235.390192] rdma_rxe: qp#153 moved to error state [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c [ 235.411374] rdma_rxe: qp#154 moved to error state [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 [ 235.895828] rdma_rxe: qp#161 moved to error state " Not sure if they are problems. IMO, we should make further investigations. Thanks Zhu Yanjunquoted
Thanks