Re: [PATCH 3/5] parallel-checkout: add configuration options
From: Matheus Tavares Bernardino <hidden>
Date: 2021-04-02 14:46:01
On Wed, Mar 31, 2021 at 1:33 AM Christian Couder [off-list ref] wrote:
On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares [off-list ref] wrote:quoted
The above benchmarks show that parallel checkout is most effective on repositories located on an SSD or over a distributed file system. For local file systems on spinning disks, and/or older machines, the parallelism does not always bring a good performance. For this reason, the default value for checkout.workers is one, a.k.a. sequential checkout.I wonder how many people are still using HDD, and how many will still use them in a few years. I think having 1 as the default value for checkout.workers might be good for a while for backward compatibility and stability, while people who are interested can test parallel checkout on different environments. But we might want, in a few releases, after some bugs, if any, have been fixed, to use a default, maybe 10, that will provide significant speedup for most people, and will justify the added complexity, especially as your numbers still show a small speedup for HDD when using 10.
Yeah, I agree that it would be nice to have a better default in the near future. Unfortunately, on some other HDD machines that I tested, parallel checkout was slower than the sequential version. So I think it may not be possible to enable parallelism by default now. Nevertheless, I was also experimenting with some ideas to auto-detect if parallelism is efficient in a given environment/repo and auto-enable it if so. One interesting possibility suggested by Ævar [1] was to implement this as a git maintenance task, which could periodically probe the system and tune the checkout settings appropriately. [1]: https://lore.kernel.org/git/87y2ixpvos.fsf@evledraar.gmail.com/ (local)
quoted
@@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void) return parallel_checkout.status; } +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100Using a "static const int" might be a bit better.
Ok, I will change that.
quoted
+void get_parallel_checkout_configs(int *num_workers, int *threshold) +{ + if (git_config_get_int("checkout.workers", num_workers)) + *num_workers = 1;I think it's better when an important default value like this "1" is made more visible using a "static const int" or a "#define".
Will do.
quoted
@@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state) if (parallel_checkout.nr < num_workers) num_workers = parallel_checkout.nr; - if (num_workers <= 1) { + if (num_workers <= 1 || parallel_checkout.nr < threshold) {Here we check the number of workers...quoted
write_items_sequentially(state); } else { struct pc_worker *workers = setup_workers(state, num_workers);[...]quoted
@@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o, } } stop_progress(&progress); - errs |= run_parallel_checkout(&state); + if (pc_workers > 1) + errs |= run_parallel_checkout(&state, pc_workers, pc_threshold);
...but the number of workers was already checked here.
The re-checking at run_parallel_checkout() is important because `num_workers` might actually become <= 1 after the above check at check_updates(). This happens when there aren't enough enqueued entries for 2+ workers, so we fall back to sequential checkout. It also kind of works as a safe-mechanism for the case where the run_parallel_checkout() caller forgot to check if the user actually wants parallelism before calling the function.