Re: [PATCH 2/3] path-walk: fix setup of pending objects

From: Derrick Stolee <hidden>
Date: 2025-08-21 20:33:24

On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:

On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote:

quoted

From: Derrick Stolee <redacted>

The previous change established a buggy instance of 'git repack -adf
--path-walk' when there exist paths that are tracked in the index and
that is the only instance of those paths in the history of the
repository. This change fixes that bug.

The core problem here is that the "maybe_interesting" member of 'struct
type_and_oid_list' is not initialized to '1'. This member was added in
6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
2024-12-20) in a way to help when creating packfiles for a small commit
range using the sparse path algorithm (enabled by pack.useSparse=true).

The idea here is that the list is marked as "maybe_interesting" if an
object is added that does not have the UNINITERSTING flag on it. Later,

s/UNINITERSTING/UNINTERESTING/

Thanks!

quoted

this is checked again in case all objects in the list were marked
UNINTERESTING after that point in time. In this case, the algorithm
skips the list as there is no reason to visit it.

This leads to the problem where the "maybe_interesting" member was not
appropriately initialized when the list is created from pending objects.
This is the fix for now.

To help avoid this from happening in the future, a follow-up change will
make initializing lists use a shared method instead of allowing for an
update to this initialization process to miss some existing copies.

Yeah, I wanted to say that this feels quite fragile to me and very easy
to miss. Does this mechanism buy us a lot of performance in the first
place? Because if not we might as well just remove it entirely.

The details for the space savings and moderate time cost are listed in
e5394794a5 (pack-objects: thread the path-based compression, 2025-05-16).
These improvements are on top of those from --name-hash-version=2 by
using a different way to focus delta calculations.

A larger, internal example for this can be seen in this table (based on
testing today):

Mode      |  Size    | Time
----------+----------+------
version 1 |  16.0 GB | 83min
version 2 |   9.9 GB | 77min
path walk |   6.4 GB | 74min

But if the answer is "yes" then adding APIs around it feels like a good
alternative.

Making the code less brittle to changes is good, but also I'm interested
in ways to improve our test infrastructure or adding defensive features.
I mentioned a couple of ideas in an earlier message.

Thanks,
-Stolee

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help