Re: [PATCH 03/15] cache: add an algo member to struct object_id

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2021-04-15 08:47:05

On Wed, Apr 14 2021, brian m. carlson wrote:

On 2021-04-13 at 12:12:21, Derrick Stolee wrote:

quoted

On 4/10/2021 11:21 AM, brian m. carlson wrote:

quoted

Now that we're working with multiple hash algorithms in the same repo,
it's best if we label each object ID with its algorithm so we can
determine how to format a given object ID. Add a member called algo to
struct object_id.

Signed-off-by: brian m. carlson <redacted>
---
 hash.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hash.h b/hash.h
index 3fb0c3d400..dafdcb3335 100644
--- a/hash.h
+++ b/hash.h

@@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 
 struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
+	int algo;
 };

What are the performance implications of adding this single bit
(that actually costs us 4 to 8 bytes, based on alignment)? Later
in the series you add longer hash comparisons, too. These seem
like they will affect performance for existing SHA-1 repos, and
it would be nice to know how much we are paying for this support.

I will do some performance numbers on these patches, but it will likely
be the weekend before I can get to it.  I think this will add 4 bytes on
most platforms, since int is typically 32 bits, and the alignment
requirement would be for the most strictly aligned member, which is the
int, so a 4-byte alignment.  I don't think the alignment requirements
are especially onerous here.

I think if you're doing such a perf test one where we have SHA-1 mode
with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
interesting as well.

It seems like good SHA1s to test that are 5a0cc8aca79 and
13eeedb5d17. Running:

    GIT_PERF_REPEAT_COUNT=10 \
    GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
    GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
    ./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh

(I added a fsck --connectivity-only test)

Gives us:

    Test                               5a0cc8aca79         13eeedb5d17            
    ------------------------------------------------------------------------------
    0001.1: rev-list --all             2.46(2.22+0.14)     2.48(2.25+0.14) +0.8%  
    0001.2: rev-list --all --objects   10.79(10.22+0.14)   10.92(10.24+0.20) +1.2%
    1450.1: fsck --connectivity-only   16.61(15.42+0.34)   16.94(15.60+0.32) +2.0%

So at least on my box none of those are outside of the confidence
intervals. This was against my copy of git.git. Perhaps it matters more
under memory pressure.

quoted

I assume that we already checked what happened when GIT_MAX_RAWSZ
increased, but that seemed worth the cost so we could have SHA-256
at all. I find the justification for this interoperability mode to
be less significant, and potentially adding too much of a tax onto
both SHA-1 repos that will never upgrade, and SHA-256 repos that
upgrade all at once (or start as SHA-256).

The entire goal of the interoperability is to let people seamlessly and
transparently move from SHA-1 to SHA-256.  Currently, the only way
people can move a SHA-1 repository to a SHA-256 repository is with
fast-import and fast-export, which loses all digital signatures and tags
to blobs.  This also requires a flag day.

SHA-1 can now be attacked for USD 45,000.  That means it is within the
budget of a dedicated professional and virtually all medium or large
corporations, including even most municipal governments, to create a
SHA-1 collision.

Is that for vanilla SHA-1, or SHA-1DC?

Unfortunately, the way we deal with this is to die, so
as soon as this happens, the repository fails closed.  While an attacker
cannot make use of the collisions to spread malicious objects, because
of the way Git works, they can effectively DoS a repository, which is in
itself a security issue.  Fixing this requires major surgery.

Can you elaborate on this? I believe that we die on any known collision
via the SHA1-DC code, and even if we didn't have that we'd detect the
collision (see [1] for the code) and die while the object is in the
temporary quarantine.

I believe such a request is cheaper to serve than one that doesn't
upload colliding objects, we die earlier (less CPU etc.), and don't add
objects to the store.

So what's the DoS vector?

We need the interoperability code to let people transition their
repositories away from SHA-1, even if it has some performance impact,
because without that most SHA-1 repositories will never transition.
That's what's outlined in the transition plan, and why that approach was
proposed, even though it would be nicer to avoid having to implement it
at all.

There's no question that we need working interop.

The question at least in my mind is why that interop is happening by
annotating every object held in memory with whether they're SHA-1 or
SHA-256, as opposed to having some translation layer earlier in the
chain.

Not all our file or in-memory structures are are like that, e.g. the
commit graph has a header saying "this is a bunch of SHA-1/256", and the
objects that follow are padded to that actual hash size, not the max
size we know about.

My understanding of the transition plan was that we'd e.g. have a
SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.

But that by the time I ran say a "git commit" none of that machinery
needed to care that I was interoping with a SHA-1 repo on the other end.
It would just happily have all SHA-256 objects, create new ones, and
only by the time I needed to push them would something kick in to
re-hash them.

I *think* the anwer is just "it's easier on the C-level" and "the
wastage doesn't seem to matter much", which is fair enough.

*Goes and digs in the ML archive*:

    https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t (local)
    https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/ (local)

To answer (some) of that myself:

Digging up some of the initial discussion that seems to be the case, at
that point there was a suggestion of:

    struct object_id {
        unsigned char hash_type;
        union {
            unsigned char sha1[GIT_SHA1_RAWSZ];
            unsigned char sha256[GIT_SHA256_RAWSZ];
        } hash;
    };

To which you replied:
    
    What I think might be more beneficial is to make the hash function
    specific to a particular repository, and therefore you could maintain
    something like this:
    
      struct object_id {
          unsigned char hash[GIT_MAX_RAWSZ];
      };

It wouldn't matter for the memory use to make it a union, but my reading
of the above is that the reason for the current object_id not-a-union
structure might not be valid now that there's a "hash_type" member, no?

I will endeavor to make the performance impact as small as possible, of
course, and ideally there will be none.  I am sensitive to the fact that
people do run absurdly large workloads on Git, as we both know, and I do
want to support that.

All of the above being said I do wonder if for those who worry about
hash size inflating their object store whether a more sensible end-goal
if that's an issue wouldn't be to store abbreviated hashes.

As long as you'd re-hash/inflate the size in the case of collisions
(which would be a more expensive check at the "fetch" boundary) you
could do so safely, and the result would be less memory consumption.

But maybe it's just a non-issue :)

1. https://lore.kernel.org/git/20181113201910.11518-1-avarab@gmail.com/ (local)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help