Re: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling
From: "D. Wythe" <alibuda@linux.alibaba.com >
Date: 2026-02-27 09:29:40
Also in:
linux-rdma, linux-s390, lkml
On Fri, Feb 27, 2026 at 10:11:38AM +0530, Mahanta Jambigi wrote:
On 24/02/26 7:49 am, D. Wythe wrote:quoted
On Fri, Feb 13, 2026 at 04:53:28PM +0530, Mahanta Jambigi wrote:quoted
On 09/02/26 1:23 pm, D. Wythe wrote:quoted
On Fri, Feb 06, 2026 at 04:58:23PM +0530, Mahanta Jambigi wrote:quoted
On 02/02/26 3:18 pm, D. Wythe wrote:quoted
The current SMC-R implementation relies on global per-device CQs and manual polling within tasklets, which introduces severe scalability bottlenecks due to global lock contention and tasklet scheduling overhead, resulting in poor performance as concurrency increases. Refactor the completion handling to utilize the ib_cqe API and standard RDMA core CQ pooling. This transition provides several key advantages: 1. Multi-CQ: Shift from a single shared per-device CQ to multiple link-specific CQs via the CQ pool. This allows completion processing to be parallelized across multiple CPU cores, effectively eliminating the global CQ bottleneck. 2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ enables Dynamic Interrupt Moderation from the RDMA core, optimizing interrupt frequency and reducing CPU load under high pressure. 3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval using container_of() on the embedded ib_cqe. 4. Code Simplification: This refactoring results in a reduction of ~150 lines of code. It removes redundant sequence tracking, complex lookup helpers, and manual CQ management, significantly improving maintainability. Performance Test: redis-benchmark with max 32 connections per QP Data format: Requests Per Second (RPS), Percentage in brackets represents the gain/loss compared to TCP. | Clients | TCP | SMC (original) | SMC (cq_pool) | |---------|----------|---------------------|---------------------| | c = 1 | 24449 | 31172 (+27%) | 34039 (+39%) | | c = 2 | 46420 | 53216 (+14%) | 64391 (+38%) | | c = 16 | 159673 | 83668 (-48%) <-- | 216947 (+36%) | | c = 32 | 164956 | 97631 (-41%) <-- | 249376 (+51%) | | c = 64 | 166322 | 118192 (-29%) <-- | 249488 (+50%) | | c = 128 | 167700 | 121497 (-27%) <-- | 249480 (+48%) | | c = 256 | 175021 | 146109 (-16%) <-- | 240384 (+37%) | | c = 512 | 168987 | 101479 (-40%) <-- | 226634 (+34%) | The results demonstrate that this optimization effectively resolves the scalability bottleneck, with RPS increasing by over 110% at c=64 compared to the original implementation.I applied your patch to the latest kernel(6.19-rc8) & saw below Performance results: 1) In my evaluation, I ran several *uperf* based workloads using a request/response (RR) pattern, and I observed performance *degradation* ranging from *4%* to *59%*, depending on the specific read/write sizes used. For example, with a TCP RR workload using 50 parallel clients (nprocs=50) sending a 200‑byte request and reading a 1000‑byte response over a 60‑second run, I measured approximately 59% degradation compared to SMC‑R original performance.The only setting I changed was net.smc.smcr_max_conns_per_lgr = 32, all other parameters were left at their default values. redis-benchmark is a classic Request/Response (RR) workload, which contradicts your test results. Since I'm unable to reproduce your results, it would be very helpful if you could share the specific test configuration for my analysis.I used a simple client–server setup connected via 25 Gb/s RoCE_Express2 adapters on the same LAN(connection established via SMC-R v1). After running the commands shown below, I observed a performance degradation of up to 59%. Server: smc_run uperf -s Client: smc_run uperf -m rr1c-200x1000-50.xml cat rr1c-200x1000-50.xml <?xml version="1.0"?> <profile name="TCP_RR"> <group nprocs="50"> <transaction iterations="1"> <flowop type="connect" options="remotehost=server_ip protocol=tcp tcp_nodelay" /> </transaction> <transaction duration="60"> <flowop type="write" options="size=200"/> <flowop type="read" options="size=1000"/> </transaction> <transaction iterations="1"> <flowop type="disconnect" /> </transaction> </group> </profile>Using the exact same XML profile you provided, I tested this on a 25Gb NIC. I observed no degradation. Instead, performance improved significantly: Original: ~1.08 Gb/s Patched: ~5.1 Gb/s I suspect the 59% drop might be due to connections falling back to TCP. Could you check smcss -a during your test to see if the traffic is actually running over SMC-R?I have checked this. The connection was successful using *SMCR* Mode itself. Also I have confirmed this via 'smcr -d stats' command which shows 0 count for TCP fallback.
Given that fallback is ruled out, and a 59% drop is quite massive, especially since I'm seeing a significant improvement on my end. Since I am unable to reproduce this locally, I would suggest analyzing the CPU consumption and perf profiles in your environment. With a regression this severe, the hotspots or differences should be fairly obvious to identify.
quoted
quoted
I installed redis-server on the server machine & redis-benchmark on the client machine & I was able to establish the SMC-R using below commands. If you could help me with the exact commands you used to measure the redis-benchmark performance, I can try the same on my setup. Server: smc_run redis-server --port <port_num> --save "" --appendonly no --protected-mode no --bind 0.0.0.0 Client: smc_run redis-benchmark -h <server_ip> -p <port_num> -n 10000 -c 50 -t ping_inline,ping_bulk -qHere are the exact commands and scripts I used for the redis-benchmark: Server: smc_run redis-server --protected-mode no --save Client: smc_run redis-benchmark -h <server_ip> -n 5000000 -t set --threads 3 -c <conn_num> D. Wythe