Thread (224 messages) 224 messages, 7 authors, 2018-04-06

Re: Reduce pack-objects memory footprint?

From: Duy Nguyen <hidden>
Date: 2018-02-28 11:25:03

On Wed, Feb 28, 2018 at 6:11 PM, Jeff King [off-list ref] wrote:
quoted
quoted
The torvalds/linux fork network has ~23 million objects,
so it's probably 7-8 GB of book-keeping. Which is gross, but 64GB in a
server isn't uncommon these days.
I wonder if we could just do book keeping for some but not all objects
because all objects simply do not scale. Say we have a big pack of
many GBs, could we keep the 80% of its bottom untouched, register the
top 20% (mostly non-blobs, and some more blobs as delta base) for
repack? We copy the bottom part to the new pack byte-by-byte, then
pack-objects rebuilds the top part with objects from other sources.
Yes, though I think it would take a fair bit of surgery to do
internally. And some features (like bitmap generation) just wouldn't
work at all.

I suspect you could simulate it, though, by just packing your subset
with pack-objects (feeding it directly without using "--revs") and then
catting the resulting packfiles together with a fixed-up header.

At one point I played with a "fast pack" that would just cat packfiles
together. My goal was to make cases with 10,000 packs workable by
creating one lousy pack, and then repacking that lousy pack with a
"real" repack. In the end I abandoned it in favor of fixing the
performance problems from trying to make a real pack of 10,000 packs. :)

But I might be able to dig it up if you want to experiment in that
direction.
Naah it's ok. I'll go similar direction, but I'd repack those pack
files too except the big one. Let's see how it turns out.
quoted
They are 32 bytes per entry, so it should take less than object_entry.
I briefly wondered if we should fall back to external rev-list too,
just to free that memory.

So about 200 MB for those objects (or maybe more for commits). Add 256
MB delta cache on top, it's still a bit far from 1.7G. There's
something I'm still missing.
Are you looking at RSS or heap? Keep in mind that you're mmap-ing what's
probably a 1GB packfile on disk. If you're under memory pressure that
won't all stay resident, but some of it will be counted in RSS.
Interesting. It was RSS.
quoted
Pity we can't do the same for 'struct object'. Most of the time we
have a giant .idx file with most hashes. We could look up in both
places: the hash table in object.c, and the idx file, to find an
object. Then those objects that are associated with .idx file will not
need "oid" field (needed to as key for the hash table). But I see no
way to make that change.
Yeah, that would be pretty invasive, I think. I also wonder if it would
perform worse due to cache effects.
It should be better because of cache effects, I think. I mean, hash
map is the least cache friendly lookup. Moving most objects out of the
hash table shrinks it, which is even nicer to cache. But we also lose
O(1) when we do binary search on .idx file (after failing to find the
same object in the hash table)
-- 
Duy
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help