Re: [PATCH] [RFC] list-objects-filter: introduce new filter sparse:buffer=<spec>

From: ZheNing Hu <hidden>
Date: 2022-08-26 05:11:37

Possibly related (same subject, not in this thread)

2022-08-08 · [PATCH] [RFC] list-objects-filter: introduce new filter sparse:buffer=<spec> · ZheNing Hu via GitGitGadget <hidden>

Derrick Stolee [off-list ref] 于2022年8月9日周二 21:37写道：

On 8/8/2022 12:15 PM, Junio C Hamano wrote:

quoted

"ZheNing Hu via GitGitGadget" [off-list ref] writes:

quoted

From: ZheNing Hu <redacted>

Although we already had a `--filter=sparse:oid=<oid>` which
can used to clone a repository with limited objects which meet
filter rules in the file corresponding to the <oid> on the git
server. But it can only read filter rules which have been record
in the git server before.

Was the reason why we have "we limit to an object we already have"
restriction because we didn't want to blindly use a piece of
uncontrolled arbigrary end-user data here?  Just wondering.

One of the ideas here was to limit the opportunity of sending an
arbitrary set of data over the Git protocol and avoid exactly the
scenario you mention.

I find that sparse-checkout uses a "cone" mode to limit the set of send
data, which can achieve performance improvement. I don't know if we can
use this mode here? With a brief look, it seems that the "cone" mode is
ensuring that the filter rule we add is directory and does not contain some
special rule '!', '?', '*', '[', ']'. But now if we transport the
filter rule to git server,
git server cannot check if the filter rule is a directory, because it involves
paths in multiple commits. e.g. in 9e6f67, "test" can be a directory, but in
e5e154e, "test" can be a file... I don't know how to solve this problem...

Another was that it is incredibly expensive to compute the set of
reachable objects within an arbitrary sparse-checkout definition,
since it requires walking trees (bitmaps do not help here). This
is why (to my knowledge) no Git hosting service currently supports
this mechanism at scale. At minimum, using the stored OID would
allow the host to keep track of these pre-defined sets and do some
precomputing of reachable data using bitmaps to keep clones and
fetches reasonable at all.

The other side of the issue is that we do not have a good solution
for resolving how to change this filter in the future, in case the
user wants to expand their sparse-checkout definition and update
their partial clone filter.

There used to be a significant issue where a 'git checkout'
would fault in a lot of missing trees because the index needed to
reference the files outside of the sparse-checkout definition. Now
that the sparse index exists, this is less of an impediment, but
it can still cause some pain.

At this moment, I think path-scoped filters have a lot of problems
that need solving before they can be used effectively in the wild.
I would prefer that we solve those problems before making the
feature more complicated. That's a tall ask, since these problems
do not have simple solutions.

I have a good idea that if we can let such path-scoped filters work,
we can apply sparse-checkout with it... Maybe one day, users can
use:

git clone --sparse --filter="sparse:buffer=dir" xxx.git

to have the repo with sparse-checkout results...
Needless to say, this is very tempting.

Thanks,
-Stolee

Thanks,
ZheNing Hu

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help