Re: [PATCH 1/4] run-command: add pipe_output to run_processes_parallel

From: Calvin Wan <hidden>
Date: 2022-09-26 17:55:28

 * Why are we configuring an API behaviour via a global variable in
   21st century?

I was mimicking how "ungroup" worked, but now that Avar mentions
that pattern was for a quick regression fix, I can fix it to pass it in as a
parameter.

 * The name "task_finished" is mentioned, but it is unclear what it
   is.  Is it one of the parameters to run_process_parallel()?

It is one of the callback functions passed in as a parameter to
run_process_paraller(). I'll go ahead and clarify that.

 * Is the effect of the new feature that task_finished callback is
   called with the output, in addition to the normal output?  I am
   not sure why it is called "pipe".  The task_finished callback may
   be free to fork a child and send the received output from the
   task to that child over the pipe, but that is what a client code
   could do and is inappropriate to base the name of the mechanism,
   isn't it?

The output in task_finished callback, before pipe_output, either
contains part of the output or the entire output of the child process,
since the output is periodically collected into stderr and then reset.
The intention of output I believe is for the caller to be able to add
anything they would like to the end (this can be seen with functions
like fetch_finished() in builtin/fetch.c). My intention with pipe_output
is to guarantee that output contains the entire output of the child
process so task_finished can utilize it.

quoted

@@ -1770,10 +1771,12 @@ int run_processes_parallel(int n,
      int output_timeout = 100;
      int spawn_cap = 4;
      int ungroup = run_processes_parallel_ungroup;
+     int pipe_output = run_processes_parallel_pipe_output;
      struct parallel_processes pp;

      /* unset for the next API user */
      run_processes_parallel_ungroup = 0;
+     run_processes_parallel_pipe_output = 0;

      pp_init(&pp, n, get_next_task, start_failure, task_finished, pp_cb,
              ungroup);
@@ -1800,7 +1803,8 @@ int run_processes_parallel(int n,
                              pp.children[i].state = GIT_CP_WAIT_CLEANUP;
              } else {
                      pp_buffer_stderr(&pp, output_timeout);
-                     pp_output(&pp);
+                     if (!pipe_output)
+                             pp_output(&pp);

So, we do not send the output from the child to the regular output
channel when pipe_output is in effect.  OK.

quoted

              }
              code = pp_collect_finished(&pp);
              if (code) {

And no other code changes?  This is quite different from what I
expected from reading the proposed log message.

Am I correct to say that under this new mode, we no longer flush any
output while the child task is running (due to the change in the
above hunk to omit calls to pp_output() during the run) and instead
keep accumulating in the strbuf, until the child task finishes, at
which time pp_collect_finished() will call task_finished callback.

Even though the callback usually consumes the last piece of the
output since the last pp_output() call made during the normal
execution of the run_processes_parallel() loop, because we omitted
these calls, we have full output from the child task accumulated in
the children[].err strbuf.  We may still not output .err for real,
as we may not be the output_owner, in which case we may only append
to .buffered_output member.

I am puzzled simply because, if the above summary is correct, I do
not see how a word "pipe" have a chance to come into the picture.

Ah I see what you mean here -- your summary is correct. Something
like "buffer_output" would make much more sense.

I can sort of see that in this mode, we would end up buffering the
entire output from each child task into one strbuf each, and can
avoid stalling the child tasks waiting for their turn to see their
output pipes drained.  But is this a reasonable thing to do?  How do
we control the memory consumption to avoid having to spool unbounded
amount of output from child tasks in core, or do we have a good
reason to believe that we do not have to bother?

You are correct that storing unbounded output doesn't seem like a good
idea. One idea I have is to parse output during the periodic collection rather
than waiting till the end. The other idea I had was to add another
"git status --porcelain" option that would only output the necessary
pieces of information so we wouldn't have to bother with worrying about
unbounded output.

Any other thoughts as to how I can workaround this?

Thanks!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help