Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski <luto@amacapital.net>
Date: 2026-06-04 17:26:08
Also in:
linux-fsdevel, linux-mm, linux-patches, lkml, netdev
On Thu, Jun 4, 2026 at 9:09 AM Willy Tarreau [off-list ref] wrote:
On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:quoted
On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau [off-list ref] wrote:quoted
On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:quoted
On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt [off-list ref] wrote:quoted
On Mon, 1 Jun 2026 18:33:25 +0100 Al Viro [off-list ref] wrote:quoted
On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:quoted
TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be a big simplification.FUSE might be interesting - fuse_dev_splice_read() and its ilk. Communications between the kernel and fuse server at least used to seriously want that, so that would be one place to look for unhappy userland... splice-related logics in fs/fuse/dev.c is interesting; another place like this is kernel/trace/, but I'm less familiar with that one. rostedt Cc'd (miklos already had been)Thanks for the Cc. The tracing ring buffer was specifically made to be used by splice and the libtracefs has a lot of code to use it as well. As reading the ring buffer literally swaps out the write portion with a blank read portion, that portion (sub-buffer) is used to be directly fed into splice, providing a zero-copy of the trace data from the write of the event to going into a file. trace-cmd defaults to using splice to copy the tracing ring buffer directly into files to avoid as much copying during live recordings as possible. Whatever changes we make, I would like to make sure there's no regressions in performance of trace-cmd record.Well yes, The patchset seems sensible from a quality POV. But to make a decision we should first have a decent understanding of its downside impact. I haven't seen a description of that impact in the discussion thus far. And that description is owed, please. I assume a small number of specialized applications are using vmsplice() to great effect? What are those applications? What is the impact of this change?quoted
Once we are armed with that information, is there some middle ground in which we de-feature vmsplice()? Fall back to pread/pwrite in the tricky cases and still permit vmsplicing if the application is appropriately restrictive in it usage?I'm using vmsplice() + tee() + splice() in high-performance applications, load generators to be precise, and soon a cache. This is super convenient and extremely efficient: - vmsplice() is used to prepare a "master" pipe with data to be sent over TCP or kTLS - then for each request, we do tee() from this master pipe to per-request pipes. - the per-request pipes are those that are used to deliver the data to the socket via splice(). So we effectively use vmsplice(), tee() and splice() here, and for exactly the reasons they were designed: only play with page refcount and not copy data. The code is here for the curious: https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c and its ancestor is here: https://github.com/wtarreau/httpterm/blob/master/httpterm.c It simply doubles the network bandwidth compared to not using that. (62 Gbps per core vs 31). I would seriously miss it if I couldn't use this anymore.Wait a moment. This is neat, but it's literally just a benchmark, right?No, it's a benchmark *tool*: it's being used to stress production code, which is important and super hard at high loads. You place it after your proxy and you measure the performance of the proxy (which is supposed not to be as capable as the testing tools otherwise the methodology revolves to testing the testing tools, which is not the point).quoted
I skimmed the code, and it doesn't look like a production workload, either. And you manage to get around the awfulness of the vmsplice API's complete failure to tell you when it's done with a buffer by ... never actually changing the contents of the buffer. Do you have any idea how you would write correct code that uses vmsplice for sends and then *ever* mutates the data without literally munmapping (or madvise or something) the data do you can safely mutate it?I'm not sure what you mean here Andy. I *do not* need to change the data, it's just a pre-made pattern.
What I mean is: this particular pattern seems limited for use in an actual webserver as a opposed to a load-tester.
quoted
Or discover that we already have something better, perhaps :) https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.htmlio_uring is different. We tried it "the dirty way" in the past, by emulating a poller, and it's not worth it this way. And in order to do it the right way, it needs to be done totally differently, which has impacts all over the stack. The code in the file pointed to above is just for the httpterm testing feature, but the rest is much more complex.
I'm curious how this kludge does: https://github.com/amluto/zc_bench I vibe-coded this up without much care, and I don't have the hardware needed to actually run it in an interesting manner. But, on a Linux VM on an Apple M4, I can push about 130Gbps on a single core over loopback. In theory this will do zerocopy sends (but not over loopback), and I would hope that it runs *faster* than vmsplice + tee. (I have a fancy workstation that can do a whopping 2.5Gbps. I could probably jury-rig a test over Thunderbolt at higher speeds. I have systems that are not available for this test right now that can do 10Gbps. But someone probably needs 40Gbps or better hardware for a genuinely interesting test.) -- Andy Lutomirski AMA Capital Management, LLC