Thread (13 messages) 13 messages, 4 authors, 2026-02-27

Re: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling

From: "D. Wythe" <alibuda@linux.alibaba.com >
Date: 2026-02-09 07:53:42
Also in: linux-rdma, linux-s390, lkml

On Fri, Feb 06, 2026 at 04:58:23PM +0530, Mahanta Jambigi wrote:

On 02/02/26 3:18 pm, D. Wythe wrote:
quoted
The current SMC-R implementation relies on global per-device CQs
and manual polling within tasklets, which introduces severe
scalability bottlenecks due to global lock contention and tasklet
scheduling overhead, resulting in poor performance as concurrency
increases.

Refactor the completion handling to utilize the ib_cqe API and
standard RDMA core CQ pooling. This transition provides several key
advantages:

1. Multi-CQ: Shift from a single shared per-device CQ to multiple
link-specific CQs via the CQ pool. This allows completion processing
to be parallelized across multiple CPU cores, effectively eliminating
the global CQ bottleneck.

2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ
enables Dynamic Interrupt Moderation from the RDMA core, optimizing
interrupt frequency and reducing CPU load under high pressure.

3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup
logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval
using container_of() on the embedded ib_cqe.

4. Code Simplification: This refactoring results in a reduction of
~150 lines of code. It removes redundant sequence tracking, complex lookup
helpers, and manual CQ management, significantly improving maintainability.

Performance Test: redis-benchmark with max 32 connections per QP
Data format: Requests Per Second (RPS), Percentage in brackets
represents the gain/loss compared to TCP.

| Clients | TCP      | SMC (original)      | SMC (cq_pool)       |
|---------|----------|---------------------|---------------------|
| c = 1   | 24449    | 31172  (+27%)       | 34039  (+39%)       |
| c = 2   | 46420    | 53216  (+14%)       | 64391  (+38%)       |
| c = 16  | 159673   | 83668  (-48%)  <--  | 216947 (+36%)       |
| c = 32  | 164956   | 97631  (-41%)  <--  | 249376 (+51%)       |
| c = 64  | 166322   | 118192 (-29%)  <--  | 249488 (+50%)       |
| c = 128 | 167700   | 121497 (-27%)  <--  | 249480 (+48%)       |
| c = 256 | 175021   | 146109 (-16%)  <--  | 240384 (+37%)       |
| c = 512 | 168987   | 101479 (-40%)  <--  | 226634 (+34%)       |

The results demonstrate that this optimization effectively resolves the
scalability bottleneck, with RPS increasing by over 110% at c=64
compared to the original implementation.
I applied your patch to the latest kernel(6.19-rc8) & saw below
Performance results:

1) In my evaluation, I ran several *uperf* based workloads using a
request/response (RR) pattern, and I observed performance *degradation*
ranging from *4%* to *59%*, depending on the specific read/write sizes
used. For example, with a TCP RR workload using 50 parallel clients
(nprocs=50) sending a 200‑byte request and reading a 1000‑byte response
over a 60‑second run, I measured approximately 59% degradation compared
to SMC‑R original performance.
The only setting I changed was net.smc.smcr_max_conns_per_lgr = 32, all
other parameters were left at their default values. redis-benchmark is a
classic Request/Response (RR) workload, which contradicts your test
results. Since I'm unable to reproduce your results, it would be
very helpful if you could share the specific test configuration for my
analysis.

Thanks,
D. Wythe
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help