Re: [RFC PATCH v2 30/48] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
From: David Howells <dhowells@redhat.com>
Date: 2023-03-29 15:34:33
Also in:
linux-fsdevel, linux-mm, linux-rdma, lkml
Bernard Metzler [off-list ref] wrote:
quoted
When transmitting data, call down into TCP using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather than performing several sendmsg and sendpage calls to transmit header, data pages and trailer. To make this work, the data is assembled in a bio_vec array and attached to a BVEC-type iterator. The header and trailer (if present) are copied into page fragments that can be freed with put_page().I like it a lot if it still keeps zero copy sendpage() semantics for the cases the driver can make use of data transfers w/o copy. Is 'msg.msg_flags |= MSG_SPLICE_PAGES' doing that magic?
Yes. MSG_SPLICE_PAGES indicates that you want the socket to retain your buffer and pass it directly to the device. Note that it's just a hint, however, pages that are unspliceable (eg. they belong to the slab) will get copied into a page fragment instead. Further, if the device cannot support a vector, then the hint can be ignored and all the data can be copied as normal.
'splicing' suggest just merging pages to me.
'splicing' as in what the splice system call does. Unfortunately, MSG_ZEROCOPY is already a (different) thing.
It would simplify the transmit code path substantially, also getting rid of kmap_local_page()/kunmap_local() sequences for multi-fragment sendmsg()'s.
If the ITER_ITERLIST iterator is accepted, then siw would be able to do mix KVEC and BVEC iterators, e.g. what I did for sunrpc here: https://lore.kernel.org/linux-fsdevel/20230329141354.516864-42-dhowells@redhat.com/T/#u (local) This means that in siw_tx_hdt() where I made it copy data into page fragments using page_frag_memdup() and attach that to a bvec: hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent; h = page_frag_memdup(NULL, hdr, hdr_len, GFP_NOFS, ULONG_MAX); if (!h) goto done; bvec_set_virt(&bvec[0], h, hdr_len); seg = 1; it can just set up a kvec instead. Unfortunately, it's not so easy to get rid of all of the kmap'ing as we need to do some of it to do the hashing. David