Thread (27 messages) 27 messages, 6 authors, 2022-02-23

Re: [PATCH v2 0/4] [RFC] repack: add --filter=

From: Taylor Blau <hidden>
Date: 2022-02-21 18:09:06

On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
Hi,

On Mon, 21 Feb 2022 at 03:11, Taylor Blau [off-list ref] wrote:
quoted
we would still be leaving repository
corruption on the table, just making it marginally more difficult to
achieve.
While reviewing John's patch I initially wondered if a better approach
might be something like `git repack -a -d --exclude-stdin`, passing a
list of specific objects to exclude from the new pack (sourced from
rev-list via a filter, etc). To me this seems like a less dangerous
approach, but my concern was it doesn't use the existing filter
capabilities of pack-objects, and we end up generating and passing
around a huge list of oids. And of course any mistakes in the list
generation aren't visible until it's too late.
Yeah; I think the most elegant approach would have pack-objects do as
much work as possible, and have repack be in charge of coordinating what
all the pack-objects invocation(s) have to do.
I also wonder whether there's a race condition if the repository gets
updated? If you're moving large objects out in advance, then filtering
the remainder there's nothing to stop a new large object being pushed
between those two steps and getting dropped.
Yeah, we will want to make sure that we're operating on a consistent
view of the repository. If this is all done in-process, it won't be a
problem since we'll capture an atomic snapshot of the reference states
once. If this is done across multiple processes, we'll need to make sure
we're passing around that snapshot where appropriate.

See the `--refs-snapshot`-related code in git-repack for when we write a
multi-pack bitmap for an example of the latter.
My other idea, which is growing on me, is whether repack could
generate two valid packs: one for the included objects via the filter
(as John's change does now), and one containing the filtered-out
objects. `git repack -a -d --split-filter=<filter>` Then a user could
then move/extract the second packfile to object storage, but there'd
be no way to *accidentally* corrupt the repository by using a bad
option. With this approach the race condition above goes away too.

    $ git repack -a -d -q --split-filter=blob:limit=1m
    pack-37b7443e3123549a2ddee31f616ae272c51cae90
    pack-10789d94fcd99ffe1403b63b167c181a9df493cd

First pack identifier being the objects that match the filter (ie:
commits/trees/blobs <1m), and the second pack identifier being the
objects that are excluded by the filter (blobs >1m).
I like this idea quite a bit. We also have a lot of existing tools that
would make an implementation fairly lightweight, namely pack-objects'
`--stdin-packs` mode.

Using that would look something like first having `repack` generate the
filtered pack first, remembering its name [1]. After that, we would run
`pack-objects` again, this time with `--stdin-packs`, where the positive
packs are the ones we're going to delete, and the negative pack(s)
is/are the filtered one generated in the last step.

The second invocation would leave us with a single pack which represents
all of the objects in packs we are about to delete, skipping any objects
that are in the filtered pack we just generated. In other words, it
would leave the repository with two packs: one with all of the objects
that met the filter criteria, and one with all objects that don't meet
the filter criteria.

A client could then upload the "doesn't meet the filter criteria" pack
off elsewhere, and then delete it locally. (I'm assuming this last part
in particular is orchestrated by some other command, and we aren't
encouraging users to run "rm" inside of .git/objects/pack!)
An astute --i-know-what-im-doing reader could spot that you could just
delete the second packfile and achieve the same outcome as the current
proposed patch, subject to being confident the race condition hadn't
happened to you.
Yeah, and I think this goes to my joking remark in the last paragraph.
If we allow users to delete packs at will, all bets are off regardless
of how safely we generate those packs. But I think splitting the
repository into two packs and _then_ dealing with one of them separately
as opposed to deleting objects which don't meet the filter criteria
immediately is moving in a safer direction.
Thanks,
Rob :)
Thanks,
Taylor

[1]: Optionally "name_s_", if we passed the `--max-pack-size` option to
`git pack-objects`, which we can trigger via a `git repack` option of
the same name.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help