Re: Missing and omitted objects
From: Philip Oakley <hidden>
Date: 2026-01-26 12:48:21
Had a bit of a think overnight. Some parts probably don't apply to this application, see below. On 25/01/2026 15:42, Philip Oakley wrote:
On 21/01/2026 11:54, Simon Richter wrote:quoted
Hi, we're having a bit of a discussion in Debian. The goal is to move towards git based storage for source packages, away from tarballs; ideally we'd like to reuse the upstream git archive as far as possible, so it is easy to check for differences. However, some projects are shipping files that aren't redistributable, or that we want to omit for other reasons (such as vendored dependencies, when there is a perfectly working common version available, and we really really want to make sure these don't get used accidentally).There was a discussion about allowing objects to be 'redacted' back at the Git Merge 2020 (https://git-merge.com/), under [TOPIC 3/17] Obliterate. https://lore.kernel.org/git/5B2FEA46-A12F-4DE7-A184-E8856EF66248@jramsay.com.au/ (local)
The discussion is like still informative.
quoted
The goal here is to allow the recipient of such a bundle to verify that any files received are unmodified, and get a list of paths that were removed (which may be an entire subdirectory). Ideally, they could also continue working on a clone of this and generate commits on top as long as the affected paths aren't touched.That discussion on redacting objects didn't reach any actionable conclusion that allows objects to be omitted/redacted, while keeping the branch based directed graph flow. I've continued to consider options for deliberately creating 'counterfeit' objects (old name/oid, but new/limited content) which could then be 'verified' through a facsimile object with the same new/limited content but a properly hashed name/oid. I haven't shared any of that with the list.
The idea of eliminating objects by OID, totally, from the repo is not suitable for the use case. It would be an all-or-nothing response, rather than a tailored response.
quoted
The minimal amount of data we'd want to archive is a single commit and its tree and dependencies, plus optionally a signed tag pointing at it if it exists (i.e. the same information we get if we use git-archive, plus the signature on the tag, plus the option to clone from such a snapshot). For the simple case where nothing is removed, this already works well and covers most of the use cases, but, sadly, not all of them.You could simply branch that special commit that will have all the deletions, plus a 'deletions' file diff file (assuming you want to highlight those deletions..), and then leave that branch as a stub, with a tag, and remove that old branch name such that the tag is the thing that retains the special commit in the hierarchy, and it's parent still holds within the regular git commit graph.
This may still be a useful tailoring where a separate commit is generated which omits unwanted files/content. This is quite lightweight in terms of repo size because of the inherent de-duplication of common content. It's only the updated trees that need storing. Philip
quoted
As a side effect, this could make recovery of a broken repository that is missing objects more robust.Broken repos are scarce, more often than not being compatibility issues between (*nix) Git and Git-for-Windows (case sensitivity, sizeof(long), character limits, etc.). However redaction and overlarge files still fit into the 'Don't do that' category (expect the unexpected..). There is also the distinction between the meta-data and content. The former also includes the data that holds together the commit graphs integrity (hash of hashes) and filenames, directory names and commit texts (point 15 of the Git Merge discussion). Being inside the hash verified meta data makes it "hard" to break and create exceptions. A mechanism for marking leaf objects as 'removed'/abscissed/absconded would help here. It's tricky to do that safely for a commit, as it also carries parent information which must be retained. For a blob (leaf) object, with its free form text, it is possible to have a fixed format, fixed length (hash specific) counterfeit object, e.g. "Git redact abcd01245.."(*) which would then also exist as a facsimile (i.e.has a true hash oid) object within some authenticated part of the graph, and the counterfeit exist in place of the 'broken' blob object with that self referential "abcd01245.." oid. For trees, it becomes necessary to locate a bit of free text in the meta data to provide self reference, and make it appear as either the empty tree or empty file(blob). The true oid of such a counterfeit tree likewise would need a way of existing within some authenticated part of the wider graph. Perhaps a step too far at this stage of hand waving.quoted
Right now, I'd like some feedback whether someone has a better idea, and if such a feature could ever work or if it violates some fundamental design principles.It's a big ask. Finding one specific feature (just on) that could actually be made to work would provide a toe hold for discussion. At least this is a solid desire from within the community's infrastructure.. At present there is no mechanism for assuming that a piece of blob *content* is "correct" but that the oid it is stored under is incorrect / does not match. We already have/had the `--literally` option for creating arbitrary content, but not it's corollary `--use-oid=abcd01245..`. see https://lore.kernel.org/git/20250516045010.GL22242@coredump.intra.peff.net/ (local) Peff cc'dquoted
Simon(*) I more wanted "Git redact abcd01245.hexoid Base64oid" to reduce accidental creation of such objects and allow double checking of the oid. But maybe that's too cute. ;-)