Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2018-05-01 12:36:46
On Tue, May 01 2018, Derrick Stolee wrote:
On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:quoted
On Tue, May 01 2018, Derrick Stolee wrote:quoted
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:quoted
Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first.Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blobThanks. I'll add a patch with that comment to v2.quoted
quoted
@@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, ds.fn = NULL; advise(_("The candidates are:")); - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); + for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect); + QSORT(collect.oid, collect.nr, sort_ambiguous);I was wondering how the old code sorted by SHA even when the ambiguous objects were loaded from different sources (multiple pack-files, loose objects). Turns out that for_each_abbrev() does its own sort after collecting the SHAs and then calls the given function pointer only once per distinct object. This avoids multiple instances of the same object, which may appear multiple times across pack-files. I only ask because now we are doing two sorts. I wonder if it would be more elegant to provide your sorting algorithm to for_each_abbrev() and let it call show_ambiguous_object as before. Another question is if we should use this sort generally for all calls to for_each_abbrev(). The only other case I see is in builtin/revparse.c.When preparing v2 I realized how confusing this was, so I'd added this to the commit message of my WIP re-roll which should explain this: A note on the implementation: I started out with something much simpler which just replaced oid_array_sort() in sha1-array.c with a custom sort function before calling oid_array_for_each_unique(). But then dumbly noticed that it doesn't work because the output function was tangled up with the code added in fad6b9e590 ("for_each_abbrev: drop duplicate objects", 2016-09-26) to ensure we don't display duplicate objects. That's why we're doing two passes here, first we need to sort the list and de-duplicate the objects, then sort them in our custom order, and finally output them without re-sorting them. I suppose we could also make oid_array_for_each_unique() maintain a hashmap of emitted objects, but that would increase its memory profile and wouldn't be worth the complexity for this one-off use-case, oid_array_for_each_unique() is used in many other places.How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs.
Because there's no de-duplication without the array first being sorted in oidcmp() order, which oid_array_for_each_unique() checks for and re-sorts if !array->sorted. I.e. its de-duplication is just a state machine where it won't call the callback if the currently processed element has the same SHA1 as the last one.
Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique().
As noted above this won't work, because the function inherently relies on the array being sorted to be able to de-duplicate. Doing this will yield duplicate entries.