Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
From: Yunsheng Lin <hidden>
Date: 2024-10-30 11:31:03
Also in:
linux-iommu, linux-mm, lkml
On 2024/10/29 21:58, Toke Høiland-Jørgensen wrote:
Yunsheng Lin [off-list ref] writes:quoted
quoted
quoted
I would prefer the waiting too if simple waiting fixed the test cases that Youglong and Haiqing were reporting and I did not look into the rabbit hole of possible caching in networking. As mentioned in commit log and [1]: 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30 secs, which was reported by Haiqing. 2. skb_defer_free_flush(): this may cause infinite delay if there is no triggering for net_rx_action(), which was reported by Yonglong. For case 1, is it really ok to stall the driver unbound up to 30 secs for the default setting of defragmentation timeout? For case 2, it is possible to add timeout for those kind of caching like the defragmentation timeout too, but as mentioned in [2], it seems to be a normal thing for this kind of caching in networking:Both 1 and 2 seem to be cases where the netdev teardown code can just make sure to kick the respective queues and make sure there's nothing outstanding (for (1), walk the defrag cache and clear out anything related to the netdev going away, for (2) make sure to kick net_rx_action() as part of the teardown).It would be good to be more specific about the 'kick' here, does it mean taking the lock and doing one of below action for each cache instance? 1. flush all the cache of each cache instance. 2. scan for the page_pool owned page and do the finegrained flushing.Depends on the context. The page pool is attached to a device, so it should be possible to walk the skb frags queue and just remove any skbs that refer to that netdevice, or something like that.
I am not sure if netdevice is still the same when passing through all sorts of software netdevice, checking if it is the page_pool owned page seems safer? The scaning/flushing seems complicated and hard to get it right if it is depending on internal detail of other subsystem's cache implementation.
As for the lack of net_rx_action(), this is related to the deferred freeing of skbs, so it seems like just calling skb_defer_free_flush() on teardown could be an option.
That was my initial thinking about the above case too if we know which percpu sd to be passed to skb_defer_free_flush() or which cpu to trigger its net_rx_action(). But it seems hard to tell which cpu napi is running in before napi is disabled, which means skb_defer_free_flush() might need to be called for every cpu with softirq disabled, as skb_defer_free_flush() calls napi_consume_skb() with budget being 1 or call kick_defer_list_purge() for each CPU.
quoted
quoted
quoted
"Eric pointed out/predicted there's no guarantee that applications will read / close their sockets so a page pool page may be stuck in a socket (but not leaked) forever."As for this one, I would put that in the "well, let's see if this becomes a problem in practice" bucket.As the commit log in [2], it seems it is already happening. Those cache are mostly per-cpu and per-socket, and there may be hundreds of CPUs and thousands of sockets in one system, are you really sure we need to take the lock of each cache instance, which may be thousands of them, and do the flushing/scaning of memory used in networking, which may be as large as '24 GiB' mentioned by Jesper?Well, as above, the two issues you mentioned are per-netns (or possibly per-CPU), so those seem to be manageable to do on device teardown if the wait is really a problem.
As above, I am not sure if it is still the same netns if the skb is passing through all sorts of software netdevice?
But, well, I'm not sure it is? You seem to be taking it as axiomatic that the wait in itself is bad. Why? It's just a bit memory being held on to while it is still in use, and so what?
Actually, I thought about adding some sort of timeout or kicking based on jakub's waiting patch too. But after looking at more caching in the networking, waiting and kicking/flushing seems harder than recording the inflight pages, mainly because kicking/flushing need very subsystem using page_pool owned page to provide a kicking/flushing mechanism for it to work, not to mention how much time does it take to do all the kicking/flushing. It seems rdma subsystem uses a similar mechanism: https://lwn.net/Articles/989087/
-Toke