Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]
From: Andy Lutomirski <luto@kernel.org>
Date: 2019-11-02 23:15:03
Also in:
keyrings, linux-block, linux-fsdevel, linux-security-module, linux-usb, lkml
On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds [off-list ref] wrote:
On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds [off-list ref] wrote:quoted
But I don't think anybody actually _did_ any of that. But that's basically the argument for the three splice operations: write/vmsplice/splice(). Which one you use depends on the lifetime and the source of your data. write() is obviously for the copy case (the source data might not be stable), while splice() is for the "data from another source", and vmsplace() is "data is from stable data in my vm".Btw, it's really worth noting that "splice()" and friends are from a more happy-go-lucky time when we were experimenting with new interfaces, and in a day and age when people thought that interfaces like "sendpage()" and zero-copy and playing games with the VM was a great thing to do.
I suppose a nicer interface might be: madvise(buf, len, MADV_STABILIZE); (MADV_STABILIZE is an imaginary operation that write protects the memory a la fork() but without the copying part.) vmsplice_safer(fd, ...); Where vmsplice_safer() is like vmsplice, except that it only works on write-protected pages. If you vmsplice_safer() some memory and then write to the memory, the pipe keeps the old copy. But this can all be done with memfd and splice, too, I think.
It turns out that VM games are almost always more expensive than just copying the data in the first place, but hey, people didn't know that, and zero-copy was seen a big deal. The reality is that almost nobody uses splice and vmsplice at all, and they have been a much bigger headache than they are worth. If I could go back in time and not do them, I would. But there have been a few very special uses that seem to actually like the interfaces. But it's entirely possible that we should kill vmsplice() (likely by just implementing the semantics as "write()") because it's not common enough to have the complexity.
I think this is the right choice. FWIW, the openssl vmsplice() call looks dubious, but I suspect it's okay because it's vmsplicing to a netlink socket, and the kernel code on the other end won't read the data after it returns a response. --Andy