Thread (7 messages) 7 messages, 3 authors, 2022-02-26

Re: [PATCH v2 0/4] [RFC] repack: add --filter=

From: John Cai <hidden>
Date: 2022-02-26 16:01:52

Thank you for everyone's feedback. Really appreciate the collaboration!

On 23 Feb 2022, at 14:31, Junio C Hamano wrote:
Christian Couder [off-list ref] writes:
quoted
quoted
For what it's worth, I am fine having a mode of repack which allows us
to remove objects that we know are stored by a promisor remote. But this
series doesn't do that, so users could easily run `git repack -d
--filter=...` and find that they have irrecoverably corrupted their
repository.
In some cases we just know the objects we are removing are stored by a
promisor remote or are replicated on different physical machines or
both, so you should be fine with this.
So, we need to decide if an object we have that is outside the
narrowed filter definition was (and still is, but let's keep the
assumption the whole lazy clone mechanism makes: promisor remotes
will never shed objects that they once served) available at the
promisor remote, but I suspect we have too little information to
reliably do so.  It is OK to assume that objects in existing packs
taken from the promisor remotes and everything reachable from them
(but missing from our object store) will be available to us from
there.  But if we see an object that is outside of _new_ filter spec
(e.g. you fetched with "max 100MB", now you are refiltering with
"max 50MB", narrowing the spec, and you need to decide for an object
that weigh 70MB), can we tell if that can be retrieved from the
promisor or is it unique to our repository until we push it out?  I
am not sure.  For that matter, do we even have a way to compare if
the new filter spec is a subset, a superset, or neither, of the
original filter spec?
let me try to summarize (perhaps over simplify) the main concern folks have
with this feature, so please correct me if I'm wrong!

As a user, if I apply a filter that ends up deleting objects that it turns
out do not exist anywhere else, then I have irrecoverably corrupted my
repository.

Before git allows me to delete objects from my repository, it should be pretty
certain that I have path to recover those objects if I need to.

Is that correct? It seems to me that, put another way, we don't want to give
users too much rope to hang themselves.

I can see why we would want to do this. In this case, there have been a couple
of alternative ideas proposed throughout this thread that I think are viable and
I wanted to get folks thoughts.

1. split pack file - (Rob gave this idea and Taylor provided some more detail on
   how using pack-objects would make it fairly straightforward to implement)

when a user wants to apply a filter that removes objects from their repository,
split the packfile into one containing objects that are filtered out, and
another packfile with objects that remain.

pros: simple to implement
cons: does not address the question "how sure am I that the objects I want to
filter out of my repository exist on a promsior remote?"

2. check the promisor remotes to see if they contain the objects that are about
   to get deleted. Only delete objects that we find on promisor remotes.

pros: provides assurance that I have access to objects I am about to delete from
a promsior remote.
cons: more complex to implement. [*]

Out of these two, I like 2 more for the aforementioned pros.

* I am beginning to look into how fetches work and am still pretty new to the
codebase so I don't know if this is even feasible, but I was thinking perhaps
we could write a function that fetches with a --filter and create a promisor
packfile containing promisor objects (this operaiton would have to somehow
ignore the presence of the actual objects in the repository). Then, we would
have a record of objects we have access to. Then, repack --filter can remove
only the objects contained in this promisor packfile.
quoted
If you are not fine with this because sometimes a user might use it
without knowing, then why are you ok with commands deleting refs not
checking that there isn't a regular repack removing dangling objects?
Sorry, I do not follow this argument.  Your user may do "branch -D"
because the branch deleted is no longer needed, which may mean that
objects only reachable from the deleted branch are no longer needed.
I do not see what repack has anything to do with that.  As long as
the filter spec does not change (in other words, before this series
is applied), the repack that discards objects that are known to be
reachable from objects in packs retrieved from promisor remote, the
objects that are no longer reachable may be removed and that will
not lose objects that we do not know to be retrievable from there
(which is different from objects that we know are unretrievable).
But with filter spec changing after the fact, I am not sure if that
is safe.  IOW, "commands deleting refs" may have been OK without
this series, but this series may be what makes it not OK, no?

Puzzled.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help