Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Jeff King <hidden>
Date: 2023-11-14 22:04:37
On Mon, Nov 13, 2023 at 05:34:42PM -0500, Taylor Blau wrote:
quoted
It might not be too hard to just let in-process callers access the objects we've written. A quick and dirty patch is below, which works with the test you suggested (the test still fails because it finds a conflict, but it gets past the "woah, I can't find that sha1" part).That's a very slick idea, and I think that this series has some legs to stand on now as a result.
I'm glad people seem to like it. ;)
I was mostly throwing it out there as an illustration, and I don't plan
on polishing it up further. But in case somebody else wants to work on
it, here are random extra thoughts on the topic:
- rather than teach packfile.c about bulk checkin, I think it might be
cleaner to let packed_git structs have a function pointer for
accessing the index (and if NULL, naturally we'd open the .idx file
in the usual way). And then bulk checkin could just register the
"fake" packed_git
- the flip side, though, is: would it be weird for other parts of the
code to ever see the fake bulk packed_git in the list? It doesn't
represent a real packfile the way the other ones do. So maybe it is
better to keep it separate
- I suspect this scheme may violate some bounds-checking of the
packfiles. For example, we haven't written the hashfile trailer yet in
the still-open packfile. But I think use_pack() assumes we always
have an extra the_hash_algo->rawsz bytes of padding at the end. I'm
not sure if that would ever cause us to accidentally read past the
end of the file.
- as you mentioned, an oidmap would be an obvious replacement for the
existing unsorted list
- the existing already_written() function faces the same problem. I
don't think anybody cared because we'd usually only bulk checkin a
few huge objects. But with --write-pack, we might write a lot of
such objects, and the linear scan in already_written() makes it
accidentally quadratic.
- speaking of already_written(): it calls repo_has_object_file() to
see if we can elide the write. It probably should be using
freshen_*_object() in the usual way. Again, this is an existing bug
but I suspect nobody noticed because the bulk checkin code path
hardly ever kicks in.
-Peff