Re: [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.

From: Kuniyuki Iwashima <kuniyu@google.com>
Date: 2026-05-24 04:03:13
Also in: bpf

On Sat, May 23, 2026 at 2:20 AM [off-list ref] wrote:

quoted

diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h

@@ -64,6 +64,10 @@
 extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
                                 struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;

+struct bpf_sock_ops_kern;
+extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
+                                      int rcvlowat) __ksym;
+
 void *bpf_cast_to_kern_ctx(void *) __ksym;

 extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;

[ ... ]

quoted

diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h

@@ -74,6 +74,8 @@ #define ETH_P_IPV6          0x86DD

 #define NEXTHDR_TCP          6

+#define TCPHDR_FIN           0x01
+
 #define TCPOPT_NOP           1
 #define TCPOPT_EOL           0
 #define TCPOPT_MSS           2

[ ... ]

quoted

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c

@@ -0,0 +1,350 @@

[ ... ]

quoted

+             .event = {
+                     { .type = RPC_EVENT_AUTOLOWAT,  .val = 1},
+                     /* The first descriptor is partial. */
+                     {.type = RPC_EVENT_SEND,        .len = 1},
+                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
+                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE},
+                     /* The first descriptor is available. */
+                     {.type = RPC_EVENT_SEND,        .len = RPC_DESC_SIZE - 1},
+                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
+                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+                     /* The first header is ready. */
+                     {.type = RPC_EVENT_SEND,        .len = 100},
+                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
+                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+                     /* skb has the first payload and 1 byte of the next descriptor. */
+                     {.type = RPC_EVENT_SEND,        .len = 150 + 1},
+                     {.type = RPC_EVENT_EPOLL,       .nfds = 1},
+                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+                     /* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
+                     {.type = RPC_EVENT_RECV,        .len = RPC_DESC_SIZE + 150 + 100},
+                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
+                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE},

Does calling recv() actually trigger the BPF program to update SO_RCVLOWAT?

Since the kernel only invokes BPF_SOCK_OPS_RCVQ_CB when a new skb is added
to the receive queue, and not when data is dequeued via recv(),

Again, this is wrong.

the BPF
program doesn't seem to run here.

Could this cause the test to fail unconditionally because sk_rcvlowat
retains its previous value?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/ (local)
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

quoted

+static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
+                                    struct rpc_test_case *test_case)
+{
+     struct rpc_event *event = test_case->event;
+     char *ptr = test_case->data;
+     struct epoll_event ev;
+     socklen_t optlen;
+     int err, optval;
+     char buf[4096];
+
+     if (tcp_autolowat_build_data(test_case))
+             return;
+
+     while (1) {
+             switch (event->type) {
+             case RPC_EVENT_END:
+                     return;
+             case RPC_EVENT_AUTOLOWAT:
+                     err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
+                                      &event->val, sizeof(event->val));
+                     if (!ASSERT_OK(err, "setsockopt"))
+                             return;
+                     break;
+             case RPC_EVENT_SEND:
+                     err = send(cb->client, ptr, event->len, 0);
+                     if (!ASSERT_EQ(err, event->len, "send"))
+                             return;
+
+                     ptr += event->len;
+                     break;

Could this lead to race conditions causing flaky tests?

No, it's over loopback.

The test calls send() to transmit data and immediately executes the next
event. Because network transmission over the loopback interface involves
asynchronous softirq processing, send() might return before the packet is
fully processed by the receiving socket.

If the next event is RPC_EVENT_EPOLL with a timeout of 0, could
epoll_wait() prematurely return 0? Similarly, for RPC_EVENT_RCVLOWAT,
could getsockopt() read a stale sk_rcvlowat value before the TCP stack and
BPF program have finished processing the packet?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/ (local)
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

quoted

diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c

@@ -0,0 +1,326 @@

[ ... ]

quoted

+static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
+                             struct tcp_autolowat_cb *cb,
+                             u32 seq, u32 end_seq)
+{
+     struct rpc_descriptor *rpc_desc;
+     u32 rpc_copied_seq;
+     u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
+     u64 rpc_len;
+     int err;
+
+     rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
+
+     if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
+             copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
+     else
+             copy_len = end_seq - rpc_copied_seq;
+
+     /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
+      * clang omits the "copy_len == 0" check below, which is necessary
+      * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
+      */

This isn't a bug, but multi-line comments in the BPF subsystem are
expected to have the opening '/*' on its own line:

        /*
         * Since LLVM commit ...
         * ...
         */

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/ (local)
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

quoted

+static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
+                           struct tcp_autolowat_cb *cb,
+                           struct tcp_sock *tp)
+{
+     /* To handle wraparound. */
+     u32 val = 0;
+
+     LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
+         TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+     if (before(tp->copied_seq, cb->rpc_desc_seq))
+             val = cb->rpc_desc_seq - tp->copied_seq;
+     else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
+             val = RPC_DESC_SIZE;
+     else
+             val = cb->rpc_end_seq - tp->copied_seq;

Does this code incorrectly set SO_RCVLOWAT to the total combined size of
all unread frames when multiple RPC frames are received in a single skb?

If multiple complete frames arrive in the same skb, the parsing loop in
tcp_do_autolowat() parses all of them and advances cb->rpc_desc_seq to the
end of the last parsed frame. This sets:

val = cb->rpc_desc_seq - tp->copied_seq

which evaluates to the length of all unread frames combined.

If an application sequentially reads one frame at a time, tp->copied_seq
advances and the queue size decreases, but sk_rcvlowat is not updated
because BPF_SOCK_OPS_RCVQ_CB only fires on packet enqueue.

This comment is based on the wrong assumption too :/

The remaining queue size will then be strictly less than sk_rcvlowat,
causing epoll_wait to block and trapping the remaining fully-received
frames in the receive queue.

Could this hang applications that rely on epoll to read one message at a
time?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/ (local)
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help