Thread (82 messages) 82 messages, 11 authors, 2018-05-11

Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2018-05-01 14:10:48
Subsystem: documentation, the rest · Maintainers: Jonathan Corbet, Linus Torvalds

On Tue, May 01 2018, Derrick Stolee wrote:
On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote:
quoted
On Tue, May 01 2018, Derrick Stolee wrote:
quoted
From: Ævar Arnfjörð Bjarmason <redacted>

Here is what I mean by sorting during for_each_abbrev(). This seems to work for
me, so I don't know what the issue is with this one-pass approach.
[...]
+static int sort_ambiguous(const void *a, const void *b)
+{
+	int a_type = oid_object_info(a, NULL);
+	int b_type = oid_object_info(b, NULL);
+	int a_type_sort;
+	int b_type_sort;
+
+	/*
+	 * Sorts by hash within the same object type, just as
+	 * oid_array_for_each_unique() would do.
+	 */
+	if (a_type == b_type)
+		return oidcmp(a, b);
+
+	/*
+	 * Between object types show tags, then commits, and finally
+	 * trees and blobs.
+	 *
+	 * The object_type enum is commit, tree, blob, tag, but we
+	 * want tag, commit, tree blob. Cleverly (perhaps too
+	 * cleverly) do that with modulus, since the enum assigns 1 to
+	 * commit, so tag becomes 0.
+	 */
+	a_type_sort = a_type % 4;
+	b_type_sort = b_type % 4;
+	return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  			  unsigned flags)
  {
@@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data)
  	find_short_object_filename(&ds);
  	find_short_packed_object(&ds);

+	QSORT(collect.oid, collect.nr, sort_ambiguous);
+	collect.sorted = 1;
+
Yes this works. You're right. I wasn't trying to intentionally omit
stuff in my recent 878t93zh60.fsf@evledraar.gmail.com, I'd just written
this code some days ago and forgotten why I did what I was doing (and
this is hard to test for), but it's all coming back to me now.

The actual requirement for oid_array_for_each_unique() working properly
is that you've got to feed it in hash order,
To work properly, duplicate entries must be consecutive. Since
duplicate entries have the same type, our sort satisfies this
condition.
quoted
but my new sort_ambiguous()
still does that (barring any SHA-1 collisions, at which point we have
bigger problems), so two passes aren't needed. So yes, this apporoach
works and is one-pass.

But that's just an implementation detail of the current sort method,
when I wrote this I was initially playing with other sort orders,
e.g. sorting SHAs regardless of type by the mtime of the file I found
them in. With this approach I'd start printing duplicates if I changed
the internals of sort_ambiguous() like that.
That makes sense.
quoted
But I think it's extremely implausible that we'll start sorting things
like that, so I'll just take this method of doing it and add some
comment saying we must hashcmp() the entries in our own sort function
for the de-duplication to work, I don't see us ever changing that.
Sounds good.
Actually I'm having second thoughts about that and thinking I might keep
my original approach (with a better explanation).

A few more lines of code seems worthwhile in order to not break the
assumptions a documented API is making, no matter how briefly, so I set
about documenting this case and supporting it, since
e.g. oid_array_lookup() will completely fail with the hack of setting
the .sorted member, and came up with this:
diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt
index b0c11f868d..ff87260220 100644
--- a/Documentation/technical/api-oid-array.txt
+++ b/Documentation/technical/api-oid-array.txt
@@ -16,6 +16,20 @@ Data Structures
 	the actual data. The `nr` member contains the number of items in
 	the set.  The `alloc` and `sorted` members are used internally,
 	and should not be needed by API callers.
++
+Both the `oid_array_lookup` and `oid_array_for_each_unique` functions
+rely on the array being sorted. For the former it's an absolute
+requirenment that the internal `oid_array_sort` function has been
+called on it, bu for the latter it's enough that the elements are
+ordered in such a way as to guarantee that identical object IDs are
+adjacent in the array.
++
+This is useful e.g. to print output where commits, tags etc. are
+grouped together (barring a hash collision they won't have the same
+object ID), in such cases the `custom_sorted` member can be set to `1`
+before calling `oid_array_for_each_unique`, and it'll skip its own
+sorting. Once it's been set calling e.g. `oid_array_lookup` without it
+being cleared again will cause an internal panic, so use it carefully.

 Functions
 ---------
diff --git a/sha1-array.c b/sha1-array.c
index 466a926aa3..cbae07ff78 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -18,6 +18,7 @@ static void oid_array_sort(struct oid_array *array)
 {
 	QSORT(array->oid, array->nr, void_hashcmp);
 	array->sorted = 1;
+	array->custom_sorted = 0;
 }

 static const unsigned char *sha1_access(size_t index, void *table)
@@ -28,6 +29,13 @@ static const unsigned char *sha1_access(size_t index, void *table)

 int oid_array_lookup(struct oid_array *array, const struct object_id *oid)
 {
+	if (array->custom_sorted)
+		/*
+		 * We could also just clear custom_sorted here, but if
+		 * the caller is custom sorting and then calling this
+		 * that's likely something they'd like to know about.
+		 */
+		BUG("PANIC: Cannot lookup OIDs in arrays with a custom sort!");
 	if (!array->sorted)
 		oid_array_sort(array);
 	return sha1_pos(oid->hash, array->oid, array->nr, sha1_access);
@@ -39,6 +47,7 @@ void oid_array_clear(struct oid_array *array)
 	array->nr = 0;
 	array->alloc = 0;
 	array->sorted = 0;
+	array->custom_sorted = 0;
 }

 int oid_array_for_each_unique(struct oid_array *array,
@@ -47,7 +56,7 @@ int oid_array_for_each_unique(struct oid_array *array,
 {
 	int i;

-	if (!array->sorted)
+	if (!array->sorted && !array->custom_sorted)
 		oid_array_sort(array);

 	for (i = 0; i < array->nr; i++) {
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..bfa77ba1e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -6,6 +6,7 @@ struct oid_array {
 	int nr;
 	int alloc;
 	int sorted;
+	int custom_sorted;
 };

 #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
diff --git a/sha1-name.c b/sha1-name.c
index b81e07adbb..d190800db0 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -490,9 +490,11 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data)
 	find_short_packed_object(&ds);

 	QSORT(collect.oid, collect.nr, sort_ambiguous);
-	collect.sorted = 1;

+	collect.custom_sorted = 1;
 	ret = oid_array_for_each_unique(&collect, fn, cb_data);
+	collect.custom_sorted = 0;
+
 	oid_array_clear(&collect);
 	return ret;
 }
So maybe I should just stop worrying and YOLO it, it just seems wrong to
leave such a fragile setup in place where we set .sorted=1 and some
future refactoring reasonably tries to call oid_array_lookup() on it and
silently fails.

What do you think?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help