Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
From: Jeff King <hidden>
Date: 2025-01-18 12:43:45
Subsystem:
the rest · Maintainer:
Linus Torvalds
On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote:
It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN for the unsafe variant. Then we'd at least get a loud error from the caller (as opposed to the code before your patch, which is undefined behavior). I dunno.
Doing this:
diff --git a/hash.h b/hash.h
index 68d4292e6d..ad2c919991 100644
--- a/hash.h
+++ b/hash.h@@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p) size_t i; for (i = 0; i < GIT_HASH_NALGOS; i++) { const struct git_hash_algo *algop = &hash_algos[i]; - if (p == algop || (algop->unsafe && p == algop->unsafe)) + if (p == algop) return i; } return GIT_HASH_UNKNOWN;
on top of your series doesn't trigger any issues in the test suite.
Which I think shows that we would never have triggered the UB from the
original implementation, either[1]. Still, I think I prefer your loop to
being one errant function call away from undefined behavior. But maybe
returning UNKNOWN (as above) is a safer bet than losing the inverse
nature of the by_ptr() and by_id() functions, which could otherwise
cause hard-to-find interactions?
-Peff
[1] One thing that still puzzles me. If you further add a BUG() to that
function when we'd return UNKNOWN, you can see that we trigger a few
cases in the test suite where we pass in NULL (e.g., because we're
not in a repo). I think these are already UB with the existing
"p - hash_algos" implementation! We'll return what is effectively
"p" cast to an int.
E.g. if I do this:
diff --git a/hash.h b/hash.h
index 756166ce5e..ff31156416 100644
--- a/hash.h
+++ b/hash.h
@@ -320,6 +320,8 @@ int hash_algo_by_length(int len);
/* Identical, except for a pointer to struct git_hash_algo. */
static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
{
+ if (!p)
+ BUG("null hash, return is %d", (int)(p - hash_algos));
return p - hash_algos;
}
then t1517 shows grep dying with:
BUG: hash.h:324: null hash, return is -1646754892
I think it "works" because the backtrace is:
[...]
#5 0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
#6 0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
#7 0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
#8 0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
baselen=0) at dir.c:1699
[...]
#16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257
So we probably write a totally garbage algo field into the
object_id, but nobody ever ends up looking at it. Probably
something we should clean up, but way out of scope for your series.
But I do think it reinforces that returning UNKNOWN is an
improvement; it avoids undefined behavior and anybody who tried to
use it should get a BUG() from calling git_hash_unknown_*()
functions.