Re: Feature request: provide a persistent IDs on a commit

From: Stephen Finucane <hidden>
Date: 2022-07-29 12:11:33

On Tue, 2022-07-19 at 13:09 +0200, Ævar Arnfjörð Bjarmason wrote:

On Tue, Jul 19 2022, Stephen Finucane wrote:

quoted

On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:

quoted

On Mon, Jul 18 2022, Stephen Finucane wrote:

quoted

...to track evolution of a patch through time.

tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
ID' trailer used by Gerrit into git core?

Firstly, apologies in advance if this is the wrong forum to post a feature
request. I help maintain the Patchwork project [1], which a web-based tool that
provides a mechanism to track the state of patches submitted to a mailing list
and make sure stuff doesn't slip through the crack. One of our long-term goals
has been to track the evolution of an individual patch through multiple
revisions. This is surprisingly hard goal because oftentimes there isn't a whole
lot to work with. One can try to guess whether things are the same by inspecting
the metadata of the commit (subject, author, commit message, and the diff
itself) but each of these metadata items are subject to arbitrary changes and
are therefore fallible.

One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
git commit hook that you can install locally. When installed, this appends a
'Change-ID' trailer to each and every commit message. In this way, the evolution
of a patch (or a "change", in Gerrit parlance) can be tracked through time since
the Change ID provides an authoritative answer to the question "is this still
the same patch". Unfortunately, there are still some obvious downside to this
approach. Not only does this additional trailer clutter your commit messages but
it's also something the user must install themselves. While Gerrit can insist
that this is installed before pushing a change, this isn't an option for any of
the common forges nor is it something git-send-email supports.

git format-patch+send-email will send your trailers along as-is, how
doesn't it support Change-Id. Does it need some support that any other
made-up trailer doesn't?

It supports sending the trailers, sure. What it doesn't support is insisting you
send this specific trailer (Change-Id). Only Gerrit can do this (server side,
thankfully, which means you don't need to ask all contributors to install this
hook if you want to rely on it for tooling, CI, etc.).

Ah, it's still unclear to me what you're proposing here though. That
send-email always (generates?) or otherwise insists on the trailer, that
it can be configured ot add it?

That send-email have some "pre-send-email" hook? Something else?

 
(Apologies for the delayed response: I was on holiday).

I'm afraid I don't have the correct terminology to describe what I'm suggesting
so I'll show an example instead.

I have configured the 'fuller' pretty formatter locally:

   $ git config format.pretty
   fuller

When I do git log on e.g. the openstack nova repo, I see:

   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane [off-list ref]
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   Commit:     Stephen Finucane [off-list ref]
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Change-Id: I643ec568ee2eb2ec1a555f813fd2f1acff915afa
       Signed-off-by: Stephen Finucane [off-list ref]

(Side note: What *is the term for the "Author", "AuthorDate", "Commit" and
"CommitDate" fields? Commit header? Commit metadata? Something else?)

My thinking is there are two types of information here: information that relates
to the "commiting" of this change and information that relates to the
"authorship" of the this change. The commit ID, 'Commit' and 'CommitDate' fields
clearly form the commit parts. I'm arguing that it would be good to have an
equivalent to the commit ID field for the authorship-type metadata.
   
   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane [off-list ref]
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   AuthorID:   I643ec568ee2eb2ec1a555f813fd2f1acff915afa
   Commit:     Stephen Finucane [off-list ref]
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Signed-off-by: Stephen Finucane [off-list ref]

At risk of repeating myself, I think this information would be valuable to allow
me to answer the question "is this the same[*] commit?". During code review,
this would allow me to track the evolution of an individual patch. Once a patch
is merged, it would allow me to track the backporting or cherry-picking of that
patch between branches (in a more reliable fashion than the "cherry picked from"
trailer that one can add with the '-x' flag).

Now I do realize that there will be issues with this. As has been noted
elsewhere in the thread, people do split patches up or merge them together, and
a patch can change so drastically during review that it doesn't resemble the
original patch in any way. However, I'd argue that in both cases the presence of
these persistent IDs would at least leave a breadcrumb trail for either tooling
or humans to follow. Similarly, it is possible for users to mess things up by
resetting or reusing the persistent ID fields, but as has been noted elsewhere
in this thread this is already an issue with the existing Author* fields (which
many users likely don't know about) yet I couldn't imagine anyone wanting to get
rid of these. It's an education thing.

I'd think for projects that care about this they're likely to have a
centralized enough workflow that it can be checked on the remote side,
whether that's some sanity check on the applier's "git am" pipeline, or
a "pre-receive" hook.

Yeah, as above I'm hoping this would form part of the core metadata of a commit
rather than a trailer or something. Tools like Gerrit could of course do
validation on this but that's outside the scope of what I'm looking at.

quoted

I imagine most people working with mailing list based workflows have their own
client side tooling to support this while software forges like GitHub and GitLab
simply don't bother tracking version history between individual commits in a
pull/merge request.

It's far from ideal, but at least GitLab shows a diff on a push to a MR,
including if it's force-pushed. I'm not sure about GitHub.

GitHub does not. Simply piling multiple additional "fix" commits onto the PR
branch results in a less horrible review experience since you can maintain
context, alas at the cost of a rotten git log. We don't need to debate the pros
and cons of the various forges though :)

Yes, I'm only mentioning it because it's worth looking at existing
"solutions" that are in use in the wild, however flawed those may be.

quoted

IMO though, it would be fantastic if third party tools
weren't necessary though. What I suspect we want is a persistent ID (or rather
UUID) that never changes regardless of how many times a patch is cherry-picked,
rebased, or otherwise modified, similar to the Author and AuthorDate fields.
Like Author and AuthorDate, it would be part of the core git commit metadata
rather than something in the commit message like Signed-Off-By or Change-ID.

Has such an idea ever been explored? Is it even possible? Would it be broadly
useful?

This has come up a bunch of times. I think that the thing git itself
should be doing is to lean into the same notion that we use for tracking
renames. I.e. we don't, we analyze history after-the-fact and spot the
renames for you.

Any idea where I'd find previous discussions on this? I did look, and the only
proposal I found was an old one that seemed to suggest including the Change-Id
commit-msg hook with git itself which is not what I'm suggesting here.

At the time I was punting on finding the links, and just working off
vague recollection, and hoping you'd go list spelunking.

But I since recalled some details, I think the most relevant thing is
this discussion about a "git evolve":

    https://lore.kernel.org/git/CAPL8ZivFmHqS2y+WmNR6faRMnuahiqwPVYsV99NiJ1QLHOs9fQ@mail.gmail.com/ (local)

Which I think you'll find useful, especially as mercurial has an
existing implementation. The wider context for that "git evolve" is (I
believe) people at Google who maintain Gerrit trying to "upstream" the
Change-Id.

Now, it hasn't landed in git.git, and it's been a few years, but going
through the details of why it fizzled out will be useful to you, if
you're interested in driving something like this forward.

Yeah to be clear I'm not suggesting tracking anything like this in Git core. My
main request is here is a persistent Author ID field. Commits as they are would
remain the same: we'd just be able to show the evolution of a "change" in
external tooling without the need for separate trailers.

There's also these two proposals from Eric Raymond:

	https://lore.kernel.org/git/20190515191605.21D394703049@snark.thyrsus.com/ (local)

This however, looks more similar to what I'm proposing. If understand this
correctly (I'm still reading the full thread), Eric is proposing allowing two
ways to reference a commit: the hash and a sort of alias. There would still be a
1:1 mapping though, which is explicitly not what I want. I'm also not suggesting
generating this stuff server-side. It should be part of the commit when
initially created, just like Author and AuthorDate.

	https://lore.kernel.org/git/20190521013250.3506B470485F@snark.thyrsus.com/ (local)

Which I'm linking to here not because I think they're viable, as you can
see from my participation in those threads I think what he suggested is
an architectural dead end as far as git is concerned.

But rather because it's conceptually adjacent (you could in principle
use nanosecond timestamps as a poor man's UUID), and much of the
follow-up discussion is about format changes in general, and if/when
those might be viable.

quoted

We have some of that in git already, as git-patch-id, and more recently
git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
into edge cases where they don't spot something that they "should"
have. Where "should" exists in the mind of the user.

That's a fair point and is of course what we (Patchwork) have to do currently.
Patchwork can track relations between individual patches but doesn't attempt to
generate these relations itself. Instead, we rely on third-party tooling. The
PaStA tool was one such example of a tool that could do this [1]. I can't
imagine a tool like Gerrit would ever work without this concept of an
authoritative (and arbitrary) identifier to track a patch's identity through
time, hence its reliance on the Change-Id trailer.

I haven't used Gerrit or Patchwork, so much of this is from ignorance on
that front, but I have spent a lot of time thinking about this in the
context of git in general.

I think as users of git go the git project itself makes very heavy use
of this, i.e. sequences of patches are substantially rewritten, split,
squashed etc. all the time, or even split into two or more sets of
submissions.

Having said all that I can't see how a Change-Id isn't a Bad Idea(TM)
for all the same reasons that pre-git SCMs file formats that track
renames explicitly were a bad idea.

I.e. yes you can come up with cases where that's "better" than what git
does, but they didn't handle splitting/merging files etc.

Similarly what happens when you have 3 patches each with their own
Change-Id and you split them into 4 patches. Is the Change-Id 1=1 or
1=many. I'm suggesting that you'd want a solution that can be many=many.

And also, that those many=many should be dynamically configurable and
inferred after the fact. E.g. range-diff will commits that are similar
enough that two authors with no knowledge of each other independently
came up with.

I touched on the splitting/merging of changes above but just to reiterate, I
don't think this is an issue. I'm using Gerrit for OpenStack-related efforts nad
mailing lists (with Patchwork tracking submissions) elsewhere. Patches are
frequently split and merged as part of a review process and can often be merged
as part of a backport (I've yet to see a patch split up when backporting but it
could happen too). If a patch is split, the original patch retains the 'Change-
ID' as well as 'Author' and 'AuthorDate' fields while the split out patch(es)
get new versions of these. If one or more patches are squashed, you get the
'Change-ID' and 'Author'/'AuthorDate' of the first patch in the series of
squashed patches. In both cases though, some Change-ID persists which means you
can track the evolution of a patch or series through time. These are all
extremely helpful breadcrumbs for reviewers.

Regarding the rename issue, I agree that this isn't something Git should do
either. As you note, it's too hard to do 100% reliably, which would be expected
goal. I'm not looking for 100% reliability here. I just want a better breadcrumb
than e.g. range-diff currently provides.

I think that range-diff is still lacking in a lot of ways, in particular:

 * It matches entire commits (log + diff) on a similarity score, I've
   often wanted a way to "weigh" it, so e.g. a matching hunk would have
   3x the matching score of a matching commit message.

   Now it often "gives up", you can give it a higher --creation-factor,
   but that's "global", so for a large range you'll often start
   including irrelevant things as well.

 * It only does 1=1 attribution, and e.g. currently can't find/represent
   a case where a commit with 3 hunks got split into two commits, with 2
   and 1 hunks, respectively. It'll (usually) show a diff to the new 2
   hunk commit, but the "new" 1 hunk will be shown as new.

   We could continue to drill down and find such "unattributed" hunks.

quoted

Perhaps we could flip this on its head. What would be the _downsides_ of
providing a persistent, arbitrary identifier on a commit similar to Author and
AuthorDate fields? There's obviously some work involved in implementing it but
assuming that was already done, what would break/be worse as a result?

That "Repository formats matter", to borrow a phrase from a classic post
about git[1]. Once you provide a way to do something it will be used,
and when that something has inherent limitations (think SCM rename
tracking) used to the exclusion of others.

You can't provide something like that as an opt-in and "upstream" it
without it inevetably trickling into a lot of areas of Git's UX.

To continue the rename example, now you can just re-arrange your source
tree and not worry about micro-managing it with "git mv" (in the "svn
mv" sense), git will figure it out after the fact.

That's a sinificant UX benefit, we can provide a *much simpler* UX as a
result.

What would be the harm of an optional "rename tracking" header? After
all the heuristic sometimes "fails".

The harm would be that if you really wanted to lean into that (even
optionally) you'd be forced to add that to all sorts of tooling, not
just the cheap convenience that is "git mv" currently.

Likewise everything from "cherry-pick" to "rebase" to "commit" would
inevitably have to learn some way to know about, carry forward and ask
the user about Change-Id's and their preservation. Don't you think so?

These are all valid points. Hopefully my points above regarding the similarity
to the Author and AuthorDate fields helps though.

Otherwise they'd be much too easy to lose track of, and if they only
reason we did all that is because we didn't think enough about the "work
it out after" approach that would be a bad investment of time.

But I may be wrong about all of that, I think one thing that would
really help clarify this & similar proposals is if people pushing it
forward came up with some basic tests for it, i.e. just something like
a:

    series-v1/
    series-v2/

Where those two directories would be the "git format-patch" output (or
whatever) of two versions of a series that Gerrit or Patchwork are now
managing, along with some (plain text?) manual mapping of which things
in v1 correspond to v2.

We could then compare how that manual attribution performs v.s. trying
to find which things match (range-diff) afterwards.

I hope my examples above helped with this, but I can prepare a sample series
(including a sample 'git log' output) if you'd like. Just let me know where
you'd like it sent.

Cheers,
Stephen

1. https://keithp.com/blog/Repository_Formats_Matter/

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help