Re: Kernel oops with 6.14 when enabling TLS
From: Matthew Wilcox <willy@infradead.org>
Date: 2025-03-04 16:53:16
Also in:
linux-block, linux-mm, linux-nvme
On Tue, Mar 04, 2025 at 05:32:32PM +0100, Hannes Reinecke wrote:
On 3/4/25 17:14, Matthew Wilcox wrote:quoted
I thought we'd done all the work needed to get rid of these pointless refcount bumps. Turns out that's only on the block side (eg commit e4cc64657bec). So what does networking need in order to understand that some iovecs do not need to mess with the refcount?The network stack needs to get hold of the page while transmission is ongoing, as there is potentially rather deep queueing involved, requiring several calls to sendmsg() and friends before the page is finally transmitted. And maybe some post-processing (checksums, digests, you name it), too, all of which require the page to be there. It's all so jumbled up ... personally, I would _love_ to do away with __iov_iter_get_pages_alloc(). Allocating a page array? Seriously? And the problem with that is that it's always takes a page(!) reference, completely oblivious to the fact whether you even _can_ take a page reference (eg for tail pages); we've hit this problem several times now (check for sendpage_ok() ...).
Calling get_page() / put_page() on a tail page is fine -- that just redirects to the head page. But calling it on a slab never made any sense; at best it gets you the equivalent of TYPESAFE_BY_RCU -- that is, the object can be freed and reallocated, but the underlying slab will not be reallocated to some other purpose.
But that's not the real issue; real issue is that the page reference is taken down in the very bowels of __iov_iter_get_pages_alloc(), but needs to be undone by the _caller_. Who might (or might not) have an idea that he needs to drop the reference here. That's why there is no straightforward conversion; you need to audit each and every caller and try to find out where the page reference (if any) is dropped. Bah. Can't we (at the very least) leave it to the caller of __iov_iter_get_pages() to get a page reference (he has access to the page array, after all ...)? That would make the interface slightly better, and it'll be far more obvious to the caller what needs to be done.
Right, that's what happened in the block layer. We mark the bio with BIO_PAGE_PINNED if the pincount needs to be dropped. As a transitional period, we had BIO_PAGE_REFFED which indicated that the page refcount needed to be dropped. Perhaps there's something similar that network could be doing.