tb/merge-tree-write-pack, was Re: What's cooking in git.git (Nov 2023, #04; Thu, 9)
From: Johannes Schindelin <hidden>
Date: 2023-11-15 12:57:18
Hi Junio, On Fri, 10 Nov 2023, Junio C Hamano wrote:
Taylor Blau [off-list ref] writes:quoted
On Thu, Nov 09, 2023 at 02:40:28AM +0900, Junio C Hamano wrote:quoted
* tb/merge-tree-write-pack (2023-10-23) 5 commits... This series received a couple of LGTMs from you and Patrick: - https://lore.kernel.org/git/xmqqo7go7w63.fsf@gitster.g/#t (local) - https://lore.kernel.org/git/ZTjKmcV5c_EFuoGo@tanuki/ (local)Yup, I am aware of them.quoted
Johannes had posted some comments[1] about instead using a temporary object store where objects are written as loose that would extend to git replay....I was hoping to hear from Johannes saying he agrees with the above. It is not strictly required, but is much nice to have once we hear "let's step back a bit---are we going in the right direction?" and it has been responded.
When I wrote about `tmp_objdir`, there were a couple of things going on in my mind: - First of all, I was hesitant to write this at all because I knew that I lack the time to engage meaningfully in any follow-up discussion. - To be honest, the approach to teach `merge-ort.c` anything about whether objects are written loosely or streamed into a pack strikes me as somewhat contrary to the goal of separating concerns. The merge machinery should not know, in my mind, how the objects are stored. - A long-standing paradigm in Git is that pack files are not used until finalized. Breaking such a paradigm after being in effect for a long time, in my experience, is always followed by unwelcome "gifts that keep on giving". - The streaming pack approach struck me as something that would only work properly if Git was designed with single-process operations in mind. But Git was originally designed around the process-proliferating Unix philosophy, and it is deeply ingrained in Git to this day. As such, I do not expect the streaming pack approach to generalize to a noteworthy fraction of Git operations, and I would love to focus on an approach that generalizes better. - At the Git Contributor Summit, I had talked about my goals, and Elijah helpfully pointed out how `--remerge-diff` does it, and I wanted to pursue that idea further. - The scenario I want to address (and that I assumed the patch series under discussion tried to address, too) is a very specific, server-side scenario where many `merge-tree`/`replay` runs produce _many_ loose objects. Quite a fraction of those are produced by processes that run into a SIGTERM-enforced timeout, and the `tmp_objdir` approach would naturally help: unneeded loose objects would not even be added to the primary object database but be reaped with the temporary object databases. - While it may sound as if the sheer number of loose objects is the primary problem, an even more pressing issue I need to address is that competing processes that try to work on a snapshot of the loose objects (which does not exist, you cannot "take a snapshot", all you can do is to enumerate the directories sequentially) seem sometimes to process loose tree/commit objects that reference other objects that have been missed due to racy reads/writes/enumerations. The reason for this is that the loose objects produced by `merge-tree`/`replay` are added non-transactionally, and concurrent reads are prone to run into racy conditions where they only see a part of those objects. - Even just using `tmp_objdir_migrate()` could help a lot by narrowing the window for those racy conditions. - The number of inodes has been a concern, yes, but not such a pressing one that I could afford spending any further thought on the idea to reduce them. In any case, a working theory is that this concern would already be helped by avoiding the loose objects produced by failing merges/rebases (whose results are not used) or by merges/rebases running into a timeout. - Streaming packs, if I understand correctly, do not do deltas. That in and of itself can cause file size issues, and light-weight maintenance may not even bother to try finding deltas, thereby causing follow-on problems. With all this in mind, I do not think that I can affort to spend brain cycles on the streaming-pack approach. I do not intend to discourage anybody from working on that approach, yet I won't encourage anyone, either. Ciao, Johannes