Re: [PATCH 2/3] path-walk: fix setup of pending objects
From: Derrick Stolee <hidden>
Date: 2025-08-21 20:33:24
On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:
On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote:quoted
From: Derrick Stolee <redacted> The previous change established a buggy instance of 'git repack -adf --path-walk' when there exist paths that are tracked in the index and that is the only instance of those paths in the history of the repository. This change fixes that bug. The core problem here is that the "maybe_interesting" member of 'struct type_and_oid_list' is not initialized to '1'. This member was added in 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING, 2024-12-20) in a way to help when creating packfiles for a small commit range using the sparse path algorithm (enabled by pack.useSparse=true). The idea here is that the list is marked as "maybe_interesting" if an object is added that does not have the UNINITERSTING flag on it. Later,s/UNINITERSTING/UNINTERESTING/
Thanks!
quoted
this is checked again in case all objects in the list were marked UNINTERESTING after that point in time. In this case, the algorithm skips the list as there is no reason to visit it. This leads to the problem where the "maybe_interesting" member was not appropriately initialized when the list is created from pending objects. This is the fix for now. To help avoid this from happening in the future, a follow-up change will make initializing lists use a shared method instead of allowing for an update to this initialization process to miss some existing copies.Yeah, I wanted to say that this feels quite fragile to me and very easy to miss. Does this mechanism buy us a lot of performance in the first place? Because if not we might as well just remove it entirely.
The details for the space savings and moderate time cost are listed in e5394794a5 (pack-objects: thread the path-based compression, 2025-05-16). These improvements are on top of those from --name-hash-version=2 by using a different way to focus delta calculations. A larger, internal example for this can be seen in this table (based on testing today): Mode | Size | Time ----------+----------+------ version 1 | 16.0 GB | 83min version 2 | 9.9 GB | 77min path walk | 6.4 GB | 74min
But if the answer is "yes" then adding APIs around it feels like a good alternative.
Making the code less brittle to changes is good, but also I'm interested in ways to improve our test infrastructure or adding defensive features. I mentioned a couple of ideas in an earlier message. Thanks, -Stolee