Re: [PATCH] mingw: handle writes to non-blocking pipe
From: René Scharfe <hidden>
Date: 2022-08-14 15:57:30
Subsystem:
kernel build + files below scripts/ (unless maintained elsewhere), the rest · Maintainers:
Nathan Chancellor, Nicolas Schier, Linus Torvalds
Am 11.08.2022 um 20:20 schrieb Jeff King:
On Thu, Aug 11, 2022 at 07:35:33PM +0200, René Scharfe wrote:quoted
OK, but we can't just split anything up as we like: POSIX wants us to keep writes up to PIPE_BUF atomic. When I read that name I mistakenly thought it was the size of the pipe buffer, but it's a different value. The minimum value according to POSIX is 512 bytes; on Linux it's 4KB. And Windows doesn't seem to define it. Bash's ulimit -p returns 8, which is in units of 512 bytes, so it's 4KB like on Linux. But that's apparently not respected by Windows' write. I just realized that we can interprete the situation slightly differently. Windows has effectively PIPE_BUF = 2^32, which means all writes are atomic. I don't see POSIX specifying a maximum allowed value, so this must be allowed. Which means you cannot rely on write performing partial writes. Makes sense?Hmm, I'm not sure. POSIX does allow writing a single byte if that's all the space there is, but only if the original _request_ was for more than PIPE_BUF. Which I guess matches what you're saying. If the original request is smaller than PIPE_BUF, it does seem to say that EAGAIN is the correct output. But it also says: There is no exception regarding partial writes when O_NONBLOCK is set. With the exception of writing to an empty pipe, this volume of POSIX.1-2017 does not specify exactly when a partial write is performed since that would require specifying internal details of the implementation. Every application should be prepared to handle partial writes when O_NONBLOCK is set and the requested amount is greater than {PIPE_BUF}, just as every application should be prepared to handle partial writes on other kinds of file descriptors. The intent of forcing writing at least one byte if any can be written is to assure that each write makes progress if there is any room in the pipe. If the pipe is empty, {PIPE_BUF} bytes must be written; if not, at least some progress must have been made. So they are clearly aware of the "we need to make forward progress" problem. But how are you supposed to do that if you only have less than PIPE_BUF bytes left to write? And likewise, even if it is technically legal, having a PIPE_BUF of 2^32 seems like a quality-of-implementation issue. And that doesn't really match what poll() is saying. The response from poll() told us we could write without blocking, which implies at least PIPE_BUF bytes available. But clearly it isn't available, since the write fails (with EAGAIN, but that is equivalent to blocking in POSIX's view here).
Turns out that's not the case on Windows: 94f4d01932 (mingw: workaround for hangs when sending STDIN, 2020-02-17) changed the compatibility implementation to 'Make `poll()` always reply "writable" for write end of the pipe.'. An updated version of the test helper confirms it (patch below) by reporting on Windows: chunk size: 1 bytes EAGAIN after 8192 bytes chunk size: 500 bytes EAGAIN after 8000 bytes chunk size: 1000 bytes EAGAIN after 8000 bytes chunk size: 1024 bytes EAGAIN after 8192 bytes chunk size: 100000 bytes partial write of 8192 bytes after 0 bytes EAGAIN after 8192 bytes On Debian I get this instead: chunk size: 1 bytes POLLOUT gone after 61441 bytes EAGAIN after 65536 bytes chunk size: 500 bytes POLLOUT gone after 60500 bytes EAGAIN after 64000 bytes chunk size: 1000 bytes POLLOUT gone after 61000 bytes EAGAIN after 64000 bytes chunk size: 1024 bytes POLLOUT gone after 62464 bytes EAGAIN after 65536 bytes chunk size: 100000 bytes partial write of 65536 bytes after 0 bytes POLLOUT gone after 65536 bytes EAGAIN after 65536 bytes So on Windows the POLLOUT bit is indeed never unset.
Lawyering aside, I think it would be OK to move forward with cutting up writes at least to a size of 512 bytes. Git runs on lots of platforms, and none of the code can make an assumption that PIPE_BUF is larger than that. I.e., we are reducing atomicity provided by Windows, but that's OK. I don't think that solves our problem fully, though. We might need to write 5 bytes, and telling mingw_write() to do so means it's supposed to abide by PIPE_BUF conventions. But again, we are in control of the calling code here. I don't think there's any reason that we care about PIPE_BUF atomicity. Certainly we don't get those atomicity guarantees on the socket side, which is where much of our writing happens, and our code is prepared to handle partial writes of any size. So we could just ignore that guarantee here.quoted
If we accept that, then we need a special write function for non-blocking pipes that chops the data into small enough chunks.I'm not sure what "small enough" we can rely on, though. Really it is the interplay between poll() and write() that we care about here. We would like to know at what point poll() will tell us it's OK to write(). But we don't know what the OS thinks of that.
Based on the output above I think Linux' poll() won't consider a pipe writable that has less than PIPE_BUF (4096) available bytes.
Or maybe we do, since I think poll() is our own compat layer. But it just seems to be based on select(). We do seem to use PeekNamedPipe() there to look on the reading side, but I don't know if there's an equivalent for writing.
This has been elusive so far (see 94f4d01932). Perhaps we should take the advice about PIPE_NOWAIT in the docs serious and use overlapping (asynchronous) writes on Windows instead. This would mean reimplementing the whole pipe_command() with Windows API commands, I imagine.
quoted
Another oddity: t3701 with yesterday's patch finishes fine with -i, but hangs without it (not just when running it via prove). O_oYes, definitely strange. I'd expect weirdness with "-v", for example, because of terminal stuff, but "-i" should have no impact on the running environment.
It's especially grating because test runs also take very looong. Avoiding xwrite() in pump_io_round() on top lets the test suite finish successfully. René --- Makefile | 1 + t/helper/test-nonblock.c | 73 ++++++++++++++++++++++++++++++++++++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + 4 files changed, 76 insertions(+) create mode 100644 t/helper/test-nonblock.c
diff --git a/Makefile b/Makefile
index d9c00cc05d..0bc028ca00 100644
--- a/Makefile
+++ b/Makefile@@ -751,6 +751,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o TEST_BUILTINS_OBJS += test-match-trees.o TEST_BUILTINS_OBJS += test-mergesort.o TEST_BUILTINS_OBJS += test-mktemp.o +TEST_BUILTINS_OBJS += test-nonblock.o TEST_BUILTINS_OBJS += test-oid-array.o TEST_BUILTINS_OBJS += test-oidmap.o TEST_BUILTINS_OBJS += test-oidtree.o
diff --git a/t/helper/test-nonblock.c b/t/helper/test-nonblock.c
new file mode 100644
index 0000000000..c9350ce894
--- /dev/null
+++ b/t/helper/test-nonblock.c@@ -0,0 +1,73 @@ +#include "test-tool.h" +#include "compat/nonblock.h" + +static void fill_pipe(size_t write_len) +{ + void *buf = xcalloc(1, write_len); + int fds[2]; + size_t total_written = 0; + int last = 0; + struct pollfd pfd; + int writable = 1; + + if (pipe(fds)) + die_errno("pipe failed"); + if (enable_nonblock(fds[1])) + die_errno("enable_nonblock failed"); + pfd.fd = fds[1]; + pfd.events = POLLOUT; + + printf("chunk size: %"PRIuMAX" bytes\n", write_len); + for (;;) { + ssize_t written; + if (poll(&pfd, 1, 0) < 0) { + if (errno == EINTR) + continue; + die_errno("poll failed"); + } + if (writable && !(pfd.revents & POLLOUT)) { + writable = 0; + printf(" POLLOUT gone after %"PRIuMAX" bytes\n", + total_written); + } + written = write(fds[1], buf, write_len); + if (written < 0) { + switch (errno) { + case EAGAIN: + printf(" EAGAIN after %"PRIuMAX" bytes\n", + total_written); + break; + case ENOSPC: + printf(" ENOSPC after %"PRIuMAX" bytes\n", + total_written); + break; + default: + printf(" errno %d after %"PRIuMAX" bytes\n", + errno, total_written); + } + break; + } else if (written != write_len) + printf(" partial write of %"PRIuMAX" bytes" + " after %"PRIuMAX" bytes\n", + (uintmax_t)written, total_written); + if (last) + break; + if (written > 0) + total_written += written; + last = !written; + }; + + close(fds[1]); + close(fds[0]); + free(buf); +} + +int cmd__nonblock(int argc, const char **argv) +{ + fill_pipe(1); + fill_pipe(500); + fill_pipe(1000); + fill_pipe(1024); + fill_pipe(100000); + return 0; +}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 318fdbab0c..562d7a9161 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c@@ -45,6 +45,7 @@ static struct test_cmd cmds[] = { { "match-trees", cmd__match_trees }, { "mergesort", cmd__mergesort }, { "mktemp", cmd__mktemp }, + { "nonblock", cmd__nonblock }, { "oid-array", cmd__oid_array }, { "oidmap", cmd__oidmap }, { "oidtree", cmd__oidtree },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index bb79927163..d9006a5298 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h@@ -36,6 +36,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv); int cmd__match_trees(int argc, const char **argv); int cmd__mergesort(int argc, const char **argv); int cmd__mktemp(int argc, const char **argv); +int cmd__nonblock(int argc, const char **argv); int cmd__oidmap(int argc, const char **argv); int cmd__oidtree(int argc, const char **argv); int cmd__online_cpus(int argc, const char **argv); --
2.37.1.windows.1