Thread (3 messages) 3 messages, 2 authors, 2026-01-26

Re: Missing and omitted objects

From: Philip Oakley <hidden>
Date: 2026-01-26 12:48:21

Had a bit of a think overnight. Some parts probably don't apply to this
application, see below.

On 25/01/2026 15:42, Philip Oakley wrote:
On 21/01/2026 11:54, Simon Richter wrote:
quoted
Hi,

we're having a bit of a discussion in Debian.

The goal is to move towards git based storage for source packages, away
from tarballs; ideally we'd like to reuse the upstream git archive as
far as possible, so it is easy to check for differences.

However, some projects are shipping files that aren't redistributable,
or that we want to omit for other reasons (such as vendored
dependencies, when there is a perfectly working common version
available, and we really really want to make sure these don't get used
accidentally).
There was a discussion about allowing objects to be 'redacted' back at
the Git Merge 2020 (https://git-merge.com/),
under [TOPIC 3/17] Obliterate.
https://lore.kernel.org/git/5B2FEA46-A12F-4DE7-A184-E8856EF66248@jramsay.com.au/ (local)
The discussion is like still informative.
quoted
The goal here is to allow the recipient of such a bundle to verify that
any files received are unmodified, and get a list of paths that were
removed (which may be an entire subdirectory). Ideally, they could also
continue working on a clone of this and generate commits on top as long
as the affected paths aren't touched.
That discussion on redacting objects didn't reach any actionable
conclusion that allows objects to be omitted/redacted, while keeping the
branch based directed graph flow. I've continued to consider options for
deliberately creating 'counterfeit' objects (old name/oid, but
new/limited content) which could then be 'verified' through a facsimile
object with the same new/limited content but a properly hashed name/oid.
I haven't shared any of that with the list.
The idea of eliminating objects by OID, totally, from the repo is not
suitable for the use case. It would be an all-or-nothing response,
rather than a tailored response.

quoted
The minimal amount of data we'd want to archive is a single commit and
its tree and dependencies, plus optionally a signed tag pointing at it
if it exists (i.e. the same information we get if we use git-archive,
plus the signature on the tag, plus the option to clone from such a
snapshot). For the simple case where nothing is removed, this already
works well and covers most of the use cases, but, sadly, not all of them.
You could simply branch that special commit that will have all the
deletions, plus a 'deletions' file diff file (assuming you want to
highlight those deletions..), and then leave that branch as a stub, with
a tag, and remove that old branch name such that the tag is the thing
that retains the special commit in the hierarchy, and it's parent still
holds within the regular git commit graph.
This may still be a useful tailoring where a separate commit is
generated which omits unwanted files/content. This is quite lightweight
in terms of repo size because of the inherent de-duplication of common
content. It's only the updated trees that need storing.

Philip
quoted
As a side effect, this could make recovery of a broken repository that
is missing objects more robust.
Broken repos are scarce, more often than not being compatibility issues
between (*nix) Git and Git-for-Windows (case sensitivity, sizeof(long),
character limits, etc.). However redaction and overlarge files still fit
into the 'Don't do that' category (expect the unexpected..).

There is also the distinction between the meta-data and content. The
former also includes the data that holds together the commit graphs
integrity (hash of hashes) and filenames, directory names and commit
texts (point 15 of the Git Merge discussion). Being inside the hash
verified meta data makes it "hard" to break and create exceptions.

A mechanism for marking leaf objects as 'removed'/abscissed/absconded
would help here. It's tricky to do that safely for a commit, as it also
carries parent information which must be retained.

For a blob (leaf) object, with its free form text, it is possible to
have a fixed format, fixed length (hash specific) counterfeit object,
e.g. "Git redact abcd01245.."(*) which would then also exist as a
facsimile (i.e.has a true hash oid) object within some authenticated
part of the graph, and the counterfeit exist in place of the 'broken'
blob object with that self referential "abcd01245.." oid.

For trees, it becomes necessary to locate a bit of free text in the meta
data to provide self reference, and make it appear as either the empty
tree or empty file(blob). The true oid of such a counterfeit tree
likewise would need a way of existing within some authenticated part of
the wider graph. Perhaps a step too far at this stage of hand waving.
quoted
Right now, I'd like some feedback whether someone has a better idea, and
if such a feature could ever work or if it violates some fundamental
design principles.
It's a big ask. Finding one specific feature (just on) that could
actually be made to work would provide a toe hold for discussion.

At least this is a solid desire from within the community's infrastructure..

At present there is no mechanism for assuming that a piece of blob
*content* is "correct" but that the oid it is stored under is incorrect
/ does not match. We already have/had the `--literally` option for
creating arbitrary content, but not it's corollary `--use-oid=abcd01245..`.
see
https://lore.kernel.org/git/20250516045010.GL22242@coredump.intra.peff.net/ (local)
Peff cc'd
quoted
   Simon
(*) I more wanted "Git redact abcd01245.hexoid Base64oid" to reduce
accidental creation of such objects and allow double checking of the
oid. But maybe that's too cute.  ;-)
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help