Thread (100 messages) 100 messages, 26 authors, 2h ago

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

From: Andy Lutomirski <luto@amacapital.net>
Date: 2026-06-04 15:53:29
Also in: linux-fsdevel, linux-mm, linux-patches, lkml, netdev

On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau [off-list ref] wrote:
On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
quoted
On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt [off-list ref] wrote:
quoted
On Mon, 1 Jun 2026 18:33:25 +0100
Al Viro [off-list ref] wrote:
quoted
On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
quoted
TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
a big simplification.
FUSE might be interesting - fuse_dev_splice_read() and its ilk.
Communications between the kernel and fuse server at least used to
seriously want that, so that would be one place to look for unhappy
userland...

splice-related logics in fs/fuse/dev.c is interesting; another place
like this is kernel/trace/, but I'm less familiar with that one.

rostedt Cc'd (miklos already had been)
Thanks for the Cc. The tracing ring buffer was specifically made to be used
by splice and the libtracefs has a lot of code to use it as well. As
reading the ring buffer literally swaps out the write portion with a blank
read portion, that portion (sub-buffer) is used to be directly fed into
splice, providing a zero-copy of the trace data from the write of the event
to going into a file.

trace-cmd defaults to using splice to copy the tracing ring buffer directly
into files to avoid as much copying during live recordings as possible.

Whatever changes we make, I would like to make sure there's no regressions
in performance of trace-cmd record.
Well yes, The patchset seems sensible from a quality POV.  But to make
a decision we should first have a decent understanding of its downside
impact.

I haven't seen a description of that impact in the discussion thus far.
And that description is owed, please.

I assume a small number of specialized applications are using
vmsplice() to great effect?  What are those applications?  What is the
impact of this change?
quoted
Once we are armed with that information, is there some middle ground in
which we de-feature vmsplice()?  Fall back to pread/pwrite in the
tricky cases and still permit vmsplicing if the application is
appropriately restrictive in it usage?
I'm using vmsplice() + tee() + splice() in high-performance applications,
load generators to be precise, and soon a cache. This is super convenient
and extremely efficient:

  - vmsplice() is used to prepare a "master" pipe with data to be sent
    over TCP or kTLS
  - then for each request, we do tee() from this master pipe to per-request
    pipes.
  - the per-request pipes are those that are used to deliver the data to
    the socket via splice().

So we effectively use vmsplice(), tee() and splice() here, and for exactly
the reasons they were designed: only play with page refcount and not copy
data. The code is here for the curious:

   https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c

and its ancestor is here:

   https://github.com/wtarreau/httpterm/blob/master/httpterm.c

It simply doubles the network bandwidth compared to not using that.
(62 Gbps per core vs 31). I would seriously miss it if I couldn't use
this anymore.
Wait a moment.  This is neat, but it's literally just a benchmark,
right?  I skimmed the code, and it doesn't look like a production
workload, either.  And you manage to get around the awfulness of the
vmsplice API's complete failure to tell you when it's done with a
buffer by ... never actually changing the contents of the buffer.  Do
you have any idea how you would write correct code that uses vmsplice
for sends and then *ever* mutates the data without literally
munmapping (or madvise or something) the data do you can safely mutate
it?
I also have mid-term plans for using vmsplice() to deliver contents from
a cache to sockets as well via splice(). Right now our cache is split into
too small chunks (1kB) to make that useful, but as soon as we can move to
4kB pages, it will make sense. There the same gains are expected, and I
would particularly dislike the idea of no longer being able to implement
zero-copy!
If I'm understanding you correctly, you see (and measured!) a
performance improvement, and you would like to use it in production.

It seems to me that this is an excellent opportunity to remember that
vmsplice gets a performance boost in a highly synthetic situation that
sort of resembles a cache scenario and then to deprecate vmsplice and
build something better!  Or discover that we already have something
better, perhaps :)

https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html

I see that this can submit a buffer without a syscall (tee + splice is
*two* syscalls!) and that it has directly addressed what I see as the
really big deficiency in vmsplice: "This second notification tells the
application that the memory associated with the send is safe to get
reused."  If I were writing the user code, I would very much want that
notification to be an explicit part of the API instead of making a
wild guess as I think I would need to do with vmsplice.

--Andy
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help