Re: [PATCH v2 0/4] [RFC] repack: add --filter=
From: John Cai <hidden>
Date: 2022-02-26 16:01:52
Thank you for everyone's feedback. Really appreciate the collaboration! On 23 Feb 2022, at 14:31, Junio C Hamano wrote:
Christian Couder [off-list ref] writes:quoted
quoted
For what it's worth, I am fine having a mode of repack which allows us to remove objects that we know are stored by a promisor remote. But this series doesn't do that, so users could easily run `git repack -d --filter=...` and find that they have irrecoverably corrupted their repository.In some cases we just know the objects we are removing are stored by a promisor remote or are replicated on different physical machines or both, so you should be fine with this.So, we need to decide if an object we have that is outside the narrowed filter definition was (and still is, but let's keep the assumption the whole lazy clone mechanism makes: promisor remotes will never shed objects that they once served) available at the promisor remote, but I suspect we have too little information to reliably do so. It is OK to assume that objects in existing packs taken from the promisor remotes and everything reachable from them (but missing from our object store) will be available to us from there. But if we see an object that is outside of _new_ filter spec (e.g. you fetched with "max 100MB", now you are refiltering with "max 50MB", narrowing the spec, and you need to decide for an object that weigh 70MB), can we tell if that can be retrieved from the promisor or is it unique to our repository until we push it out? I am not sure. For that matter, do we even have a way to compare if the new filter spec is a subset, a superset, or neither, of the original filter spec?
let me try to summarize (perhaps over simplify) the main concern folks have with this feature, so please correct me if I'm wrong! As a user, if I apply a filter that ends up deleting objects that it turns out do not exist anywhere else, then I have irrecoverably corrupted my repository. Before git allows me to delete objects from my repository, it should be pretty certain that I have path to recover those objects if I need to. Is that correct? It seems to me that, put another way, we don't want to give users too much rope to hang themselves. I can see why we would want to do this. In this case, there have been a couple of alternative ideas proposed throughout this thread that I think are viable and I wanted to get folks thoughts. 1. split pack file - (Rob gave this idea and Taylor provided some more detail on how using pack-objects would make it fairly straightforward to implement) when a user wants to apply a filter that removes objects from their repository, split the packfile into one containing objects that are filtered out, and another packfile with objects that remain. pros: simple to implement cons: does not address the question "how sure am I that the objects I want to filter out of my repository exist on a promsior remote?" 2. check the promisor remotes to see if they contain the objects that are about to get deleted. Only delete objects that we find on promisor remotes. pros: provides assurance that I have access to objects I am about to delete from a promsior remote. cons: more complex to implement. [*] Out of these two, I like 2 more for the aforementioned pros. * I am beginning to look into how fetches work and am still pretty new to the codebase so I don't know if this is even feasible, but I was thinking perhaps we could write a function that fetches with a --filter and create a promisor packfile containing promisor objects (this operaiton would have to somehow ignore the presence of the actual objects in the repository). Then, we would have a record of objects we have access to. Then, repack --filter can remove only the objects contained in this promisor packfile.
quoted
If you are not fine with this because sometimes a user might use it without knowing, then why are you ok with commands deleting refs not checking that there isn't a regular repack removing dangling objects?Sorry, I do not follow this argument. Your user may do "branch -D" because the branch deleted is no longer needed, which may mean that objects only reachable from the deleted branch are no longer needed. I do not see what repack has anything to do with that. As long as the filter spec does not change (in other words, before this series is applied), the repack that discards objects that are known to be reachable from objects in packs retrieved from promisor remote, the objects that are no longer reachable may be removed and that will not lose objects that we do not know to be retrievable from there (which is different from objects that we know are unretrievable). But with filter spec changing after the fact, I am not sure if that is safe. IOW, "commands deleting refs" may have been OK without this series, but this series may be what makes it not OK, no? Puzzled.