Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2023-03-07 14:18:57

On Fri, Feb 10 2023, Derrick Stolee wrote:

On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote:
[...]

quoted

+struct blame_tree_options {
+	struct object_id oid;
+	const char *prefix;
+	unsigned int recursive;
+	struct strvec args;
+};
+#define BLAME_TREE_OPTIONS_INIT(...) { \
+	.args = STRVEC_INIT, \
+	__VA_ARGS__ \
+}
+void blame_tree_opts_release(struct blame_tree_options *bto);
+
+struct blame_tree {
+	struct string_list paths;
+	struct rev_info rev;
+};
+#define BLAME_TREE_INIT { \
+	.paths = STRING_LIST_INIT_DUP, \
+	.rev = REV_INFO_INIT, \
+}
+
+void blame_tree_init(struct blame_tree *bt,
+		     const struct blame_tree_options *opts);
+void blame_tree_release(struct blame_tree *);
+
+typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
+			      void *data);
+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);

However, this API is too leaky. Specifically, it allows passing arbitrary
'args' instead of structured options.

Before I get into what I think needs to change here, let's look at the
initial implementation:

quoted

+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
+{
+	struct blame_tree_callback_data data;
+
+	data.paths = &bt->paths;
+	data.num_interesting = bt->paths.nr;
+	data.callback = cb;
+	data.callback_data = cbdata;
+
+	bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	bt->rev.diffopt.format_callback = blame_diff;
+	bt->rev.diffopt.format_callback_data = &data;
+
+	prepare_revision_walk(&bt->rev);
+
+	while (data.num_interesting) {
+		data.commit = get_revision(&bt->rev);
+		if (!data.commit)
+			break;
+
+		if (data.commit->object.flags & BOUNDARY) {
+			diff_tree_oid(the_hash_algo->empty_tree,
+				       &data.commit->object.oid,
+				       "", &bt->rev.diffopt);
+			diff_flush(&bt->rev.diffopt);
+		} else {
+			log_tree_commit(&bt->rev, data.commit);
+		}
+	}
+
+	return 0;
+}

When I think of 'blame-tree', I think of the following output:

  For each path (within a pathspec, recursive or not), output the first
  commit that changed that path according to simplified file history. This
  history walk begins at a single commit, but may be terminated by a
  negative commit range, so the output could indicate that we reached the
  boundary.

With that definition, the most-obvious first implementation would be to
first generate the path list, then run
`git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec.
This could be done by N prepare_revision_walk()/get_revision() executions
within a loop instead of actually executing subcommands.

The implementation in this patch is simultaneously smarter than that basic
approach, but also not smart enough for the best performance. In order to
get the optimal performance, the following parts of my output definition
are critical:

 1. We use simplified file history.
 2. The history walk begins at a single commit.

If either of these conditions are broken, then the current-best algorithm
cannot be used (and even more: our proactive caching mechanism cannot be
used).

The API as currently defined does not allow us to enforce that restriction
because the arbitrary arguments could specify `--full-history`, or worse
`--simplify-merge` (and also `--first-parent`) to modify the history mode
or even `--all` to modify the number of starting commits.

There's already a limitation on --all in this patch, or anything else
that would yield a many<->many rev<->path relationship, e.g. "--all" or
"HEAD HEAD~", that's covered by the "can only blame one tree at a time"
error condition.

For --full-history v.s. --simplify-merge the output on git.git at least
(maybe not on others?) would be the same:

	diff -u <(t/helper/test-tool -C "$PWD/t" blame-tree -- --full-history) <(t/helper/test-tool -C "$PWD/t" blame-tree -- --simplify-merge)

But of course for --first-parent it's not, but more on that below.

quoted hunk ↗ jump to hunk

The version we use in our custom fork has the historical baggage of this
open-ended argument-parsing, but because we have full control of the
callers to the CLI, we have enforced that those arguments are not being
used. While we are not currently reviewing the CLI of a builtin, the API
layer is essentially providing a pass-through of arbitrary revision options.

I am happy to see that the 'struct blame_tree_options' includes a callback
for the output, as that is valuable for both reporting results to stdout
or to collect results into a list for caching purposes, in the future
where we have that ability.

--- Recommended API ---

Using 'struct blame_tree_options' instead of many parameters creates a
good future-proof method prototype, so we can always _add_ options when
explicitly understood as important to callers of one kind or another.

However, to drop the arbitrary 'args' we need to instead make it very
explicit which commits are being used in the history walk.

struct blame_tree_options {
	struct object_id oid;
	const char *prefix;
	unsigned int recursive;
	struct commit *start;
	struct commit_list *uninteresting;
};

This definition of the struct should be enough to demonstrate the behavior
we are describing. However, for the v1 of the API, we may even want to
completely remove the 'uninteresting' choice, and instead focus only on a
single starting position ('start'). If we decide that 'uninteresting' is
valuable, then it can be added again later, hopefully after the primary
use of this command is established.

Again, thinking about the lifetime of this command in our infrastructure,
the commit range behavior was used at one point as a performance improvement
where a cache was given for a specific starting commit B, then we
dynamically computed the blame-tree for the range B..A when given a new
commit A, and merged the two results together. However, this was not always
correct due to complexities around parallel changes. A different caching
mechanism was built into the low-level algorithm itself which resolves
these complications, but also _that cache cannot be used when given range
queries_.

Further, I recommend building the simplest implementation first, since it
is actually closer to the very fast implementation in that it has two
parts:

 1. Compute the list of paths at the tip.
 2. Perform history walks for those paths.

The fast implementation does a single history walk that essentially walks
the simplified history for all of the paths simultaneously, but it is
critical to have that list of paths at the start of that walk. In that way
it's actually easier to adapt the simpler algorithm to the current-best
algorithm than it is to adapt the smart algorithm from this patch to the
current-best algorithm.

All this is to say, that I'd like to see this API start with the smallest
possible surface area and with the simplest implementation, and then I'd
be happy to contribute those algorithms within the API boundary while the
CLI is handled independently.

I hear your concern about leaving this open for optimization, and in
general I'd vehemently agree with it, except for needing to eventually
feed a command-line to setup_revisions().

Ideally the revision API would make what you're describing easy, but the
way it's currently implemented (and changing it would be a much larger
project) someone who'd like to pass structured options in the way you'd
describe will end up having to re-implement bug-for-bug compatible
versions of some subset of the option parsing in revision.c.

Isn't a way to get the best of both worlds to have a small snippet of
code that inspects the "struct rev_info" before & after
setup_revisions(), and which would only implement certain optimizations
if certain known options are provided, but not if any unknown ones are?

That way those who'd like the faster happy path could use that subset of
options, while the general API would allow any revision options. We'd
then error() or BUG() out only if we fail to map our expected paths to
OIDs.

Specifically, I'm thinking of something similar to what
match_stat_with_submodule() in diff-lib.c now does with "orig_flags".

I also think as a practical matter it's acceptable for us to leave the
underlying low-level API open-ended for potential (ab)use, while any
porcelain or plumbing tool for this would document that a given narrow
subset of revision options is what we expect.

That's what we do for format-patch, and e.g. as of my df569c3f31f
(range-diff doc: add a section about output stability, 2018-11-09)
range-diff explicitly documents that feeding it options it won't expect
might result in nonsense output.

I think those are all good ways forward here, and I'd much prefer those
to having to re-implement or pull out subsets of the current option
parsing logic in revision.c. What do you think?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help