Re: [PATCH 00/10] RFC Partial Clone and Fetch

From: Jeff Hostetler <hidden>
Date: 2017-05-03 16:38:40


On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:

From: Jeff Hostetler <redacted>


[RFC] Partial Clone and Fetch
=============================
[...]
E. Unresolved Thoughts
======================

*TODO* The server should optionally return (in a side-band?) a list
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.

Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/

I want to respond to both of these threads here.
-------------------------------------------------

Missing-Blob Support
====================

Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


1. Suppose we update the .pack file format slightly.
    () We use the 5 value in "enum object_type" to mean a
       "missing-blob".
    () We update git-pack-object as I did in my RFC, but have it
       create type 5 entries for the blobs that are omitted,
       rather than nothing.
    () Hopefully, the same logic that currently keeps pack-object
       from sending unnecessary blobs on subsequent fetches can
       also be used to keep it from sending unnecessary missing-blob
       entries.
    () The type 5 missing-blob entry would contain the SHA-1 of the
       blob and some meta-data to be explained later.

2. Make a similar change in the .idx format and git-index-pack
    to include them there.  Then blob lookup operations could
    definitively determine that a blob exists and is just not
    present locally.

3. With this, packfile-based blob-lookup operations can get a
    "missing-blob" result.
    () It should be possible to short-cut searching in other
       packfiles (because we don't have to assume that the blob
       was just misplaced in another packfile).
    () Lookup can still look for the corresponding loose blob
       (in case a previous lookup already "faulted it in").

4. We can then think about dynamically fetching it.
    () Several techniques for this are currently being
       discussed on the mailing list in other threads,
       so I won't go into this here.
    () There has also been debate about whether this should
       yield a loose blob or a new packfile.  I think both
       forms have merit and depend on whether we are limited
       to asking for a single blob or can make a batch request.
    () A dynamically-fetched loose blob is placed in the normal
       loose blob directory hierarchy so that subsequent
       lookups can find it as mentioned above.
    () A dynamically-fetched packfile (with one or more blobs)
       is written to the ODB and then the lookup operation
       completes.
       {} I want to isolate these packfiles from the main
          packfiles, so that they behave like a second-stage
          lookup and don't affect the caching/LRU nature of
          the existing first-stage packfile lookup.
       {} I also don't want the ambiguity of having 2 primary
          packfiles with a blob marked as missing in 1 and
          present in the other.

5. git-repack should be updated to "do the right thing" and
    squash missing-blob entries.

6. And etc.


Missing-Blob Entry Data
=======================

A missing-blob entry needs to contain the SHA-1 value of
the blob (obviously).  Other fields are nice to have, but
are not necessary.  Here are a few fields to consider.

A. The SHA-1 (20 bytes)

B. The raw size of the blob (5? bytes).
    () This is the cleaned size of the file as stored.  The
       server does not (and should not) have any knowledge
       of the smudging that may happen.
    () This may be useful if whatever dynamic-fetch-hook
       wants to customize its behavior, such as individually
       fetching large blobs and batch fetching smaller ones
       from the same server.
    () GVFS found it necessary to create a custom server
       end-point to get blob size data so that "ls -l"
       could show file sizes for non-present virtualized
       files.
    () 5 bytes (uint:40) should be more than enough for this.

C. A server "hint" (20 bytes)
    () Instructions to help the client fetch the blob.
    () If I have multiple remotes configured, a missing-blob
       should be fetched from the same server that created
       the missing-blob entry (since it may be the only
       one that has it).
    () If a blob is very large (and was omitted for this
       reason), the server may want to redirect the client
       to a geographically closer CDN.
    () This is the SHA-1 of a file in the repository of a
       hook (or a set of parameters to be used by a hook).
       {} This is a bit of *hand-wave* right now, but the
          idea is that you can use the information here to
          individually fetch a blob or batch fetch a set
          of blobs that have the same hint.
       {} Yes, there are security concerns here, so perhaps
          the hint file should just contain parameters for
          a stock git-fetch-pack or git-fetch-blob-pack or
          curl command (or wrapper script) that "does the
          right thing".
       {} I thought this would be more compact than listing
          detailed fetch data per-blob.  And we don't have
          to define yet another syntax.  For example, we can
          let the SHA-1 point to an administrator configured
          shell script and be done.
    () We assume that the SHA-1 file is present locally
       (not missing).  This might refer to a pinned file
       in a special ".git*" file (that we never omit) in
       HEAD.  Or it might be in a branch that all clients
       are assumed to have.


Concluding Thoughts
===================

Combining the ideas here with the partial clone/fetch
parameters and the various blob back-filling proposals
gives us the ability to create and work with sparse
repos.
() Filtering can be based upon blob size; this could be
    seen as an alternative solution to LFS for repos with
    large objects.
() Filtering could also be based upon pathnames (such as
    a sparse-checkout filter) and greatly help performance
    on very large repos where developers only work with
    small areas of the tree.

Thanks
Jeff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help