Thread (20 messages) 20 messages, 5 authors, 2019-06-04

Re: git archive generates tar with malformed pax extended attribute

From: Jeff King <hidden>
Date: 2019-05-30 11:55:58

On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote:
Am 29.05.19 um 03:17 schrieb Jeff King:
quoted
On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote:
quoted
Parsing trees with symlinks twice is not ideal, but keeps the set
structure simple -- a standard oidset suffices.
If blobs comes after trees (and they usually do in a pack), you can do
it in a single pass by marking the blob as a symlink target, and then
when we actually see that blob's contents, marking it as either OK or
problematic. And then the finish() step just correlates those with the
tree.
Good idea.  Is that ordering guaranteed?  (Stumbling about the "usually"
in your first sentence.)
It's not guaranteed. Our implementation of pack-objects does order blobs
after trees, but I suspect this could be violated in rare cases with
some of the delta-island pack layering stuff.

I think it makes sense to be sure that the receiver is correct no matter
what, but optimize for this expected case (that's what I tried to do
with the .gitmodules checks).
An ordering where dependent objects (like trees) follow the objects they
reference would be better suited for these kinds of checks..
But worse for others (e.g., like .gitmodules where it's cheap to
identify a candidate blob, but the blob check is involved; there it's
much more optimal to see the tree first).
quoted
But here the problem is in the tree, not the blob. So we're not finding
suspect blobs, but rather re-checking each tree. And no matter what we
do (whether it's visiting the object again, or creating a set or mapping
with the object names) is going to be linear there. And a repository
with a symlink in the root tree is going to revisit or put in our
mapping every single root tree.
That's true, potentially it needs remember and/or reprocess all trees,
meaning this check may double the run time of fsck in the worst case.
Example from the wild: The kernel repo currently has 36 symlinks and
6+ million objects are checked in total, and the symlink check processes
18943 trees_with_symlinks entries there.
That sounds about right. It's basically every version of every tree that
has a symlink. Did it make a noticeable difference in timing? Indexing
the whole kernel history is already a horribly slow process. :)
quoted
TBH, I'm not sure this fsck check was worth it even without the
implementation complexity.
Hmm.  git status reports such truncated symlinks as changed, so the
issue *is* already detectable.
Hmm, yeah. That makes sense, since the filesystem (well, really the
syscall API layer) cannot represent the data we are feeding it.

-Peff
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help