Thread (12 messages) 12 messages, 4 authors, 2021-08-27

Re: v5.14 RXE driver broken?

From: Bob Pearson <hidden>
Date: 2021-08-26 20:03:18
Also in: linux-block

On 8/26/21 2:03 PM, Bob Pearson wrote:
On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
quoted
On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
quoted
On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche [off-list ref] wrote:
quoted
Hi Bob,

If I run the following test against Linus' master branch then that test
passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
headers to staging"")):

# export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
    runtime    ...  48.849s

The following test fails:

# export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
    runtime  48.849s  ...  15.024s
    +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
    @@ -1,2 +1 @@
     Configured SRP target driver
    -Passed
Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
fix this problem?

And the commit will be merged into linux upstream very soon.
Please let me know Bart, if the rxe driver is still broken I will
definitely punt all the changes for RXE to the next cycle until it can
be fixed.

Jason
Jason, Bart, Zhu

I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
before the rnr timer fires which results in the command being retried immediately regardless of the
value of the timeout. I made a small change which requires the requester to wait for either the
timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
MRs was too small to run the test. I increased these by a factor of 256 which fixed that.

My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.

I will submit a patch for the rnr fix.

Bob
Well it's better but not quite done yet.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help