Thread (23 messages) 23 messages, 5 authors, 2017-05-04

Re: [PATCH 00/10] RFC Partial Clone and Fetch

From: Jonathan Nieder <hidden>
Date: 2017-05-04 18:42:29

Hi again,

Jeff Hostetler wrote:
In my original RFC there were comments/complaints that with
missing blobs we lose the ability to detect corruptions.  My
proposed changes to index-pack and rev-list (and suggestions
for other commands like fsck) just disabled those errors.
Personally, I'm OK with that, but I understand that others
would like to save the ability to distinguish between missing
and corrupted.
I'm also okay with it.  In a partial clone, in the same way as a
missing ref represents a different valid state and thus passes fsck
regardless of how it happened, a missing blob is a valid state and it
is sensible for it to pass fsck.

A person might object that previously a repository that passed "git
fsck" was a repository where "git fast-export --all" would succeed,
and if I omit a blob that is not present on the remote server then
that invariant is gone.  But that problem exists even if we have a
list of missing blobs.  The server could rewind history and garbage
collect, causing attempts on the client to fetch a previously
advertised missing blob to fail.  Or the server can disappear
completely, or it can lose all its data and have to be restored from
an older backup that is missing newer blobs.
Right, only the .pack is sent over the wire.  And that's why I
suggest listing the missing SHAs in it.  I think we need the server
to send a list of them -- whether in individual per-file type-5
records as I suggested, or a single record containing a list of
SHAs+data (which I think I prefer in hindsight).
A list of SHAs+data sounds sensible as a way of conveying additional
per-blob information (such as size).
WRT being able to discover the missing blobs, is that true in
all cases?  I don't think it is for thin-packs -- where the server
only sends stuff you don't (supposedly) already have, right ?
Generate the list of blobs referenced by trees in the pack, when you
are inflating them in git index-pack.  Omit any objects that you
already know about.  The remainder is the set of missing blobs.

One thing this doesn't tell you is if the same missing blob is
available from multiple remotes.  It associates each blob with a
single remote, the first one it was discovered from.
If instead, we have pack-object indicate that it *would have*
sent this blob normally, we don't change any of the semantics
of how pack files are assembled.  This gives the client a
definitive list of what's missing.
If there is additional information the server wants to convey about
the missing blobs, then this makes sense to me --- it has to send it
somewhere, and a separate list outside the pack seems like a good
place to put that.

When there is no additional information beyond "this is a blob I am
omitting", there is nothing the wire protocol needs to convey.  But
you've convinced me that that's usually moot because the blob size
is valuable information.

[...]
The more I think about it, I'd like to NOT put the list in the .idx
file.  If we put it in a separate peer file next to the *.{pack,idx}
we could decorate it with the name of the remote used in the fetch
or clone.
I have no strong opinions about this in either direction.

Since it only affects the local repository format and doesn't affect
the protocol, we can experiment without too much fuss. :)

[...]
I've always wondered if repack, fetch, and friends should build
a meta-idx that merges all of the current .idx files, but that
is a different conversation....
Yes, we've been thinking about a one-index-for-many-packs facility
to deal with the proliferation of packfiles with only one or a few
large objects without having to waste I/O copying them into a
concatenated pack file.

Another thing we're looking into is incorporating something like
Martin Fick's "git exproll" (
http://public-inbox.org/git/1375756727-1275-1-git-send-email-artagnon@gmail.com/)
into "git gc --auto" to more aggressively keep the number of packs
down.
On 5/3/2017 2:27 PM, Jonathan Nieder wrote:
quoted
If we were starting over, would this belong in the tree object?
(Part of the reason I ask is that we have an opportunity to sort
of start over as part of the transition away from SHA-1.)
Yes, putting the size in the tree would be nice.  That does
add 5-8 bytes to every entry in every tree (on top of the
larger hash), but it would be useful.

If we're going there, we might just define the new hash
as the concatenation of the SHA* and the length of the data
hashed.  So instead of a 32-byte SHA256, we'd have a (32 + 8)
byte value.  (Or maybe a (32 + 5) if we want to squeeze it.)
Thanks --- that feedback helps.

It doesn't stop us from having to figure something else out in the
short term, of course.

[...]
quoted
I am worried about the implications of storing this kind of policy
information in the pack file.  There may be useful information along
these lines for a server to advertise, but I don't think it belongs in
the pack file.  Git protocol relies on it being cheap to stream pack
files as is.
Perhaps.  I'm not sure it would be that expensive -- I'm thinking
about it being a server constant rather than a per-user/per-fetch
field.  But maybe we don't need it.  Assuming we can correctly
associate a missing blob with the server/remote that omitted it.
Right.

I think we're heading toward a consensus:

- that the server should (at least if requested?) inform the client of
  the blobs it is omitting and their sizes

- that this would go in the same response as the packfile, but
  out-of-line from the pack

- that it (at least optionally?) gets stored *somewhere* locally
  associated with the pack --- either through an extension to the idx
  file format or in a separate file alongside the pack file and idx
  file.  This doesn't affect the protocol so it's not expensive to
  experiment

Thanks for the thoughtful replies.

Sincerely,
Jonathan
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help