Thread (36 messages) 36 messages, 8 authors, 2024-11-18

Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound

From: Yunsheng Lin <hidden>
Date: 2024-10-30 11:31:03
Also in: linux-iommu, linux-mm, lkml

On 2024/10/29 21:58, Toke Høiland-Jørgensen wrote:
Yunsheng Lin [off-list ref] writes:
quoted
quoted
quoted
I would prefer the waiting too if simple waiting fixed the test cases that
Youglong and Haiqing were reporting and I did not look into the rabbit hole
of possible caching in networking.

As mentioned in commit log and [1]:
1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
   secs, which was reported by Haiqing.
2. skb_defer_free_flush(): this may cause infinite delay if there is no
   triggering for net_rx_action(), which was reported by Yonglong.

For case 1, is it really ok to stall the driver unbound up to 30 secs for the
default setting of defragmentation timeout?

For case 2, it is possible to add timeout for those kind of caching like the
defragmentation timeout too, but as mentioned in [2], it seems to be a normal
thing for this kind of caching in networking:
Both 1 and 2 seem to be cases where the netdev teardown code can just
make sure to kick the respective queues and make sure there's nothing
outstanding (for (1), walk the defrag cache and clear out anything
related to the netdev going away, for (2) make sure to kick
net_rx_action() as part of the teardown).
It would be good to be more specific about the 'kick' here, does it mean
taking the lock and doing one of below action for each cache instance?
1. flush all the cache of each cache instance.
2. scan for the page_pool owned page and do the finegrained flushing.
Depends on the context. The page pool is attached to a device, so it
should be possible to walk the skb frags queue and just remove any skbs
that refer to that netdevice, or something like that.
I am not sure if netdevice is still the same when passing through all sorts
of software netdevice, checking if it is the page_pool owned page seems safer?

The scaning/flushing seems complicated and hard to get it right if it is
depending on internal detail of other subsystem's cache implementation.
As for the lack of net_rx_action(), this is related to the deferred
freeing of skbs, so it seems like just calling skb_defer_free_flush() on
teardown could be an option.
That was my initial thinking about the above case too if we know which percpu
sd to be passed to skb_defer_free_flush() or which cpu to trigger its
net_rx_action().
But it seems hard to tell which cpu napi is running in before napi is disabled,
which means skb_defer_free_flush() might need to be called for every cpu with
softirq disabled, as skb_defer_free_flush() calls napi_consume_skb() with
budget being 1 or call kick_defer_list_purge() for each CPU.
quoted
quoted
quoted
"Eric pointed out/predicted there's no guarantee that applications will
read / close their sockets so a page pool page may be stuck in a socket
(but not leaked) forever."
As for this one, I would put that in the "well, let's see if this
becomes a problem in practice" bucket.
As the commit log in [2], it seems it is already happening.

Those cache are mostly per-cpu and per-socket, and there may be hundreds of
CPUs and thousands of sockets in one system, are you really sure we need
to take the lock of each cache instance, which may be thousands of them,
and do the flushing/scaning of memory used in networking, which may be as
large as '24 GiB' mentioned by Jesper?
Well, as above, the two issues you mentioned are per-netns (or possibly
per-CPU), so those seem to be manageable to do on device teardown if the
wait is really a problem.
As above, I am not sure if it is still the same netns if the skb is passing
through all sorts of software netdevice?
But, well, I'm not sure it is? You seem to be taking it as axiomatic
that the wait in itself is bad. Why? It's just a bit memory being held
on to while it is still in use, and so what?
Actually, I thought about adding some sort of timeout or kicking based on
jakub's waiting patch too.

But after looking at more caching in the networking, waiting and kicking/flushing
seems harder than recording the inflight pages, mainly because kicking/flushing
need very subsystem using page_pool owned page to provide a kicking/flushing
mechanism for it to work, not to mention how much time does it take to do all
the kicking/flushing.

It seems rdma subsystem uses a similar mechanism:
https://lwn.net/Articles/989087/
-Toke

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help