Re: [PATCH v2 0/4] [RFC] repack: add --filter=
From: Taylor Blau <hidden>
Date: 2022-02-21 18:09:06
On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
Hi, On Mon, 21 Feb 2022 at 03:11, Taylor Blau [off-list ref] wrote:quoted
we would still be leaving repository corruption on the table, just making it marginally more difficult to achieve.While reviewing John's patch I initially wondered if a better approach might be something like `git repack -a -d --exclude-stdin`, passing a list of specific objects to exclude from the new pack (sourced from rev-list via a filter, etc). To me this seems like a less dangerous approach, but my concern was it doesn't use the existing filter capabilities of pack-objects, and we end up generating and passing around a huge list of oids. And of course any mistakes in the list generation aren't visible until it's too late.
Yeah; I think the most elegant approach would have pack-objects do as much work as possible, and have repack be in charge of coordinating what all the pack-objects invocation(s) have to do.
I also wonder whether there's a race condition if the repository gets updated? If you're moving large objects out in advance, then filtering the remainder there's nothing to stop a new large object being pushed between those two steps and getting dropped.
Yeah, we will want to make sure that we're operating on a consistent view of the repository. If this is all done in-process, it won't be a problem since we'll capture an atomic snapshot of the reference states once. If this is done across multiple processes, we'll need to make sure we're passing around that snapshot where appropriate. See the `--refs-snapshot`-related code in git-repack for when we write a multi-pack bitmap for an example of the latter.
My other idea, which is growing on me, is whether repack could
generate two valid packs: one for the included objects via the filter
(as John's change does now), and one containing the filtered-out
objects. `git repack -a -d --split-filter=<filter>` Then a user could
then move/extract the second packfile to object storage, but there'd
be no way to *accidentally* corrupt the repository by using a bad
option. With this approach the race condition above goes away too.
$ git repack -a -d -q --split-filter=blob:limit=1m
pack-37b7443e3123549a2ddee31f616ae272c51cae90
pack-10789d94fcd99ffe1403b63b167c181a9df493cd
First pack identifier being the objects that match the filter (ie:
commits/trees/blobs <1m), and the second pack identifier being the
objects that are excluded by the filter (blobs >1m).I like this idea quite a bit. We also have a lot of existing tools that would make an implementation fairly lightweight, namely pack-objects' `--stdin-packs` mode. Using that would look something like first having `repack` generate the filtered pack first, remembering its name [1]. After that, we would run `pack-objects` again, this time with `--stdin-packs`, where the positive packs are the ones we're going to delete, and the negative pack(s) is/are the filtered one generated in the last step. The second invocation would leave us with a single pack which represents all of the objects in packs we are about to delete, skipping any objects that are in the filtered pack we just generated. In other words, it would leave the repository with two packs: one with all of the objects that met the filter criteria, and one with all objects that don't meet the filter criteria. A client could then upload the "doesn't meet the filter criteria" pack off elsewhere, and then delete it locally. (I'm assuming this last part in particular is orchestrated by some other command, and we aren't encouraging users to run "rm" inside of .git/objects/pack!)
An astute --i-know-what-im-doing reader could spot that you could just delete the second packfile and achieve the same outcome as the current proposed patch, subject to being confident the race condition hadn't happened to you.
Yeah, and I think this goes to my joking remark in the last paragraph. If we allow users to delete packs at will, all bets are off regardless of how safely we generate those packs. But I think splitting the repository into two packs and _then_ dealing with one of them separately as opposed to deleting objects which don't meet the filter criteria immediately is moving in a safer direction.
Thanks, Rob :)
Thanks, Taylor [1]: Optionally "name_s_", if we passed the `--max-pack-size` option to `git pack-objects`, which we can trigger via a `git repack` option of the same name.