Re: [PATCH net] net: fix race between napi kthread mode and busy poll
From: Wei Wang <hidden>
Date: 2021-02-24 22:30:27
On Wed, Feb 24, 2021 at 1:30 PM Jakub Kicinski [off-list ref] wrote:
On Wed, 24 Feb 2021 21:37:36 +0100 Eric Dumazet wrote:quoted
On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski [off-list ref] wrote:quoted
On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:quoted
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to determine if the kthread owns this napi and could call napi->poll() on it. However, if socket busy poll is enabled, it is possible that the busy poll thread grabs this SCHED bit (after the previous napi->poll() invokes napi_complete_done() and clears SCHED bit) and tries to poll on the same napi. This patch tries to fix this race by adding a new bit NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in napi_complete_done() together with NAPI_STATE_SCHED. This helps distinguish the ownership of the napi between kthread and the busy poll thread, and prevents the kthread from polling on the napi when this napi is still owned by the busy poll thread. Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Reported-by: Martin Zaharinov <redacted> Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Eric Dumazet <redacted>AFAIU sched bit controls the ownership of the poll_listI disagree. BUSY POLL never inserted the napi into a list, because the user thread was polling one napi. Same for the kthread.There is no delayed execution in busy_poll. It either got the sched bit and it knows it, or it didn't.quoted
wake_up_process() should be good enough.Well, if that's the direction maybe we should depend on the thread state more? IOW pay less attention to SCHED and have napi_complete_done() set_current_state() if thread is running? I didn't think that through fully but you can't say "wake_up_process() should be good enough" and at the same time add another bit proving it's not enough.quoted
quoted
Can we pleaseadd a poll_list for the thread and make sure the thread polls based on the list?A list ? That would require a spinlock or something ?Does the softnet list require a spinlock? Obviously with current code the list would only ever have one napi instance per thread but I think it's worth the code simplicity. napi_complete_done() dels from the list / releases that ownership already.
I think what Jakub proposed here should work. But I have a similar concern as Eric. I think the kthread belongs to the NAPI instance, and the kthread only polls on that specific NAPI if threaded mode is enabled. Adding the NAPI to a list that the kthread polls seems to be a reverse of logic. And it is unlike the sd->poll_list, where multiple NAPI instances could be added to that list and get polled. But functionality-wise, it does seem it will work.
quoted hunk ↗ jump to hunk
quoted
quoted
IMO that's far clearer than defining a forest of ownership state bits.Adding a bit seems simpler than adding a list.In terms of what? LoC? Just to find out what the LoC is I sketched this out:diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ddf4cfc12615..77f09ced9ee4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h@@ -348,6 +348,7 @@ struct napi_struct { struct hlist_node napi_hash_node; unsigned int napi_id; struct task_struct *thread; + struct list_head thread_poll_list; }; enum {diff --git a/net/core/dev.c b/net/core/dev.c index 6c5967e80132..99ff083232e9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c@@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd, */ thread = READ_ONCE(napi->thread); if (thread) { + list_add_tail(&napi->poll_list, + &napi->thread_poll_list); wake_up_process(thread); return; }@@ -6777,6 +6779,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi, return; INIT_LIST_HEAD(&napi->poll_list); + INIT_LIST_HEAD(&napi->thread_poll_list); INIT_HLIST_NODE(&napi->napi_hash_node); hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); napi->timer.function = napi_watchdog;@@ -6971,8 +6974,7 @@ static int napi_thread_wait(struct napi_struct *napi) set_current_state(TASK_INTERRUPTIBLE); while (!kthread_should_stop() && !napi_disable_pending(napi)) { - if (test_bit(NAPI_STATE_SCHED, &napi->state)) { - WARN_ON(!list_empty(&napi->poll_list)); + if (!list_emtpy(&napi->thread_poll_list)) { __set_current_state(TASK_RUNNING); return 0; }$ git diff --stat include/linux/netdevice.h | 1 + net/core/dev.c | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-)quoted
quoted
I think with just the right (wrong?) timing this patch will still not protect against disabling the NAPI.Maybe, but this patch is solving one issue that was easy to trigger. disabling the NAPI is handled already.The thread checks if NAPI is getting disabled, then time passes, then it checks if it's scheduled. If napi gets disabled in the "time passes" period thread will think that it got scheduled again.
Not sure if I understand it correctly, when you say "then it checks if it's scheduled", do you mean the schedule() call in napi_thread_wait() that re-enters this function? If so, it still checks to make sure !napi_disable_pending(napi) before it goes to poll on the napi instance. I think that is sufficient to make sure we don't poll on a NAPI that is in DISABLE state?
Sure, we can go and make special annotations in all other parts of NAPI infra, but isn't that an obvious sign of a bad design? I wanted to add that I have spent quite a bit of time hacking around the threaded NAPI thing before I had to maintain, and (admittedly my brain is not very capable but) I had a hard time getting things working reliably with netpoll, busy polling, disabling etc. IOW I'm not just claiming that "more bits" is not a good solution on a whim.