Re: [PATCH 3/5] parallel-checkout: add configuration options

From: Matheus Tavares Bernardino <hidden>
Date: 2021-04-02 14:46:01

On Wed, Mar 31, 2021 at 1:33 AM Christian Couder
[off-list ref] wrote:

On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares
[off-list ref] wrote:

quoted

The above benchmarks show that parallel checkout is most effective on
repositories located on an SSD or over a distributed file system. For
local file systems on spinning disks, and/or older machines, the
parallelism does not always bring a good performance. For this reason,
the default value for checkout.workers is one, a.k.a. sequential
checkout.

I wonder how many people are still using HDD, and how many will still
use them in a few years.

I think having 1 as the default value for checkout.workers might be
good for a while for backward compatibility and stability, while
people who are interested can test parallel checkout on different
environments. But we might want, in a few releases, after some bugs,
if any, have been fixed, to use a default, maybe 10, that will provide
significant speedup for most people, and will justify the added
complexity, especially as your numbers still show a small speedup for
HDD when using 10.

Yeah, I agree that it would be nice to have a better default in the
near future. Unfortunately, on some other HDD machines that I tested,
parallel checkout was slower than the sequential version. So I think
it may not be possible to enable parallelism by default now.

Nevertheless, I was also experimenting with some ideas to auto-detect
if parallelism is efficient in a given environment/repo and
auto-enable it if so. One interesting possibility suggested by Ævar
[1] was to implement this as a git maintenance task, which could
periodically probe the system and tune the checkout settings
appropriately.

[1]: https://lore.kernel.org/git/87y2ixpvos.fsf@evledraar.gmail.com/ (local)

quoted

@@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void)
        return parallel_checkout.status;
 }

+#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100

Using a "static const int" might be a bit better.

Ok, I will change that.

quoted

+void get_parallel_checkout_configs(int *num_workers, int *threshold)
+{
+       if (git_config_get_int("checkout.workers", num_workers))
+               *num_workers = 1;

I think it's better when an important default value like this "1" is
made more visible using a "static const int" or a "#define".

Will do.

quoted

@@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state)
        if (parallel_checkout.nr < num_workers)
                num_workers = parallel_checkout.nr;

-       if (num_workers <= 1) {
+       if (num_workers <= 1 || parallel_checkout.nr < threshold) {

Here we check the number of workers...

quoted

                write_items_sequentially(state);
        } else {
                struct pc_worker *workers = setup_workers(state, num_workers);

[...]

quoted

@@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o,
                }
        }
        stop_progress(&progress);
-       errs |= run_parallel_checkout(&state);
+       if (pc_workers > 1)
+               errs |= run_parallel_checkout(&state, pc_workers, pc_threshold);

...but the number of workers was already checked here.

The re-checking at run_parallel_checkout() is important because
`num_workers` might actually become <= 1 after the above check at
check_updates(). This happens when there aren't enough enqueued
entries for 2+ workers, so we fall back to sequential checkout. It
also kind of works as a safe-mechanism for the case where the
run_parallel_checkout() caller forgot to check if the user actually
wants parallelism before calling the function.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help