Thread (6 messages) 6 messages, 3 authors, 2021-02-24

Re: [PATCH net] net: fix race between napi kthread mode and busy poll

From: Wei Wang <hidden>
Date: 2021-02-24 22:30:27

On Wed, Feb 24, 2021 at 1:30 PM Jakub Kicinski [off-list ref] wrote:
On Wed, 24 Feb 2021 21:37:36 +0100 Eric Dumazet wrote:
quoted
On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski [off-list ref] wrote:
quoted
On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:
quoted
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in
napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in
napi_complete_done() together with NAPI_STATE_SCHED. This helps
distinguish the ownership of the napi between kthread and the busy poll
thread, and prevents the kthread from polling on the napi when this napi
is still owned by the busy poll thread.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <redacted>
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Eric Dumazet <redacted>
AFAIU sched bit controls the ownership of the poll_list
I disagree. BUSY POLL never inserted the napi into a list,
because the user thread was polling one napi.

Same for the kthread.
There is no delayed execution in busy_poll. It either got the sched bit
and it knows it, or it didn't.
quoted
wake_up_process() should be good enough.
Well, if that's the direction maybe we should depend on the thread
state more?  IOW pay less attention to SCHED and have
napi_complete_done() set_current_state() if thread is running?

I didn't think that through fully but you can't say "wake_up_process()
should be good enough" and at the same time add another bit proving
it's not enough.
quoted
quoted
Can we pleaseadd a poll_list for the thread and make sure the
thread polls based on the list?
A list ? That would require a spinlock or something ?
Does the softnet list require a spinlock?

Obviously with current code the list would only ever have one napi
instance per thread but I think it's worth the code simplicity.
napi_complete_done() dels from the list / releases that ownership
already.
I think what Jakub proposed here should work. But I have a similar
concern as Eric. I think the kthread belongs to the NAPI instance, and
the kthread only polls on that specific NAPI if threaded mode is
enabled. Adding the NAPI to a list that the kthread polls seems to be
a reverse of logic. And it is unlike the sd->poll_list, where multiple
NAPI instances could be added to that list and get polled. But
functionality-wise, it does seem it will work.
quoted hunk ↗ jump to hunk
quoted
quoted
IMO that's far clearer than defining a forest of ownership state
bits.
Adding a bit seems simpler than adding a list.
In terms of what? LoC?

Just to find out what the LoC is I sketched this out:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..77f09ced9ee4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -348,6 +348,7 @@ struct napi_struct {
        struct hlist_node       napi_hash_node;
        unsigned int            napi_id;
        struct task_struct      *thread;
+       struct list_head        thread_poll_list;
 };

 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..99ff083232e9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       list_add_tail(&napi->poll_list,
+                                     &napi->thread_poll_list);
                        wake_up_process(thread);
                        return;
                }
@@ -6777,6 +6779,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
                return;

        INIT_LIST_HEAD(&napi->poll_list);
+       INIT_LIST_HEAD(&napi->thread_poll_list);
        INIT_HLIST_NODE(&napi->napi_hash_node);
        hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
        napi->timer.function = napi_watchdog;
@@ -6971,8 +6974,7 @@ static int napi_thread_wait(struct napi_struct *napi)
        set_current_state(TASK_INTERRUPTIBLE);

        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
-                       WARN_ON(!list_empty(&napi->poll_list));
+               if (!list_emtpy(&napi->thread_poll_list)) {
                        __set_current_state(TASK_RUNNING);
                        return 0;
                }
$ git diff --stat
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)
quoted
quoted
I think with just the right (wrong?) timing this patch will still
not protect against disabling the NAPI.
Maybe, but this patch is solving one issue that was easy to trigger.

disabling the NAPI is handled already.
The thread checks if NAPI is getting disabled, then time passes, then
it checks if it's scheduled. If napi gets disabled in the "time passes"
period thread will think that it got scheduled again.
Not sure if I understand it correctly, when you say "then it checks if
it's scheduled", do you mean the schedule() call in napi_thread_wait()
that re-enters this function? If so, it still checks to make sure
!napi_disable_pending(napi) before it goes to poll on the napi
instance. I think that is sufficient to make sure we don't poll on a
NAPI that is in DISABLE state?

Sure, we can go and make special annotations in all other parts of NAPI
infra, but isn't that an obvious sign of a bad design?


I wanted to add that I have spent quite a bit of time hacking around
the threaded NAPI thing before I had to maintain, and (admittedly my
brain is not very capable but) I had a hard time getting things working
reliably with netpoll, busy polling, disabling etc. IOW I'm not just
claiming that "more bits" is not a good solution on a whim.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help