Re: git archive generates tar with malformed pax extended attribute
From: Jeff King <hidden>
Date: 2019-05-30 11:55:58
On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote:
Am 29.05.19 um 03:17 schrieb Jeff King:quoted
On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote:quoted
Parsing trees with symlinks twice is not ideal, but keeps the set structure simple -- a standard oidset suffices.If blobs comes after trees (and they usually do in a pack), you can do it in a single pass by marking the blob as a symlink target, and then when we actually see that blob's contents, marking it as either OK or problematic. And then the finish() step just correlates those with the tree.Good idea. Is that ordering guaranteed? (Stumbling about the "usually" in your first sentence.)
It's not guaranteed. Our implementation of pack-objects does order blobs after trees, but I suspect this could be violated in rare cases with some of the delta-island pack layering stuff. I think it makes sense to be sure that the receiver is correct no matter what, but optimize for this expected case (that's what I tried to do with the .gitmodules checks).
An ordering where dependent objects (like trees) follow the objects they reference would be better suited for these kinds of checks..
But worse for others (e.g., like .gitmodules where it's cheap to identify a candidate blob, but the blob check is involved; there it's much more optimal to see the tree first).
quoted
But here the problem is in the tree, not the blob. So we're not finding suspect blobs, but rather re-checking each tree. And no matter what we do (whether it's visiting the object again, or creating a set or mapping with the object names) is going to be linear there. And a repository with a symlink in the root tree is going to revisit or put in our mapping every single root tree.That's true, potentially it needs remember and/or reprocess all trees, meaning this check may double the run time of fsck in the worst case. Example from the wild: The kernel repo currently has 36 symlinks and 6+ million objects are checked in total, and the symlink check processes 18943 trees_with_symlinks entries there.
That sounds about right. It's basically every version of every tree that has a symlink. Did it make a noticeable difference in timing? Indexing the whole kernel history is already a horribly slow process. :)
quoted
TBH, I'm not sure this fsck check was worth it even without the implementation complexity.Hmm. git status reports such truncated symlinks as changed, so the issue *is* already detectable.
Hmm, yeah. That makes sense, since the filesystem (well, really the syscall API layer) cannot represent the data we are feeding it. -Peff