Thread (23 messages) 23 messages, 4 authors, 2023-02-01

Re: [PATCH 6/6] hash-object: use fsck for object checks

From: Taylor Blau <hidden>
Date: 2023-01-18 21:34:25

On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote:
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:

  - the documentation for hash-object is already vague about which
    checks we might do, saying that --literally will allow "any
    garbage[...] which might not otherwise pass standard object parsing
    or git-fsck checks". So we are already covered under the documented
    behavior.

  - users don't generally run hash-object anyway. There are a lot of
    spots in the tests that needed to be updated because creating
    garbage objects is something that Git's tests disproportionately do.

  - it's hard to imagine anyone thinking the new behavior is worse. Any
    object we reject would be a potential problem down the road for the
    user. And if they really want to create garbage, --literally is
    already the escape hatch they need.
This is the discussion I was pointing out earlier in the series as
evidence for making this behavior the new default without "--literally".

That being said, let me play devil's advocate for a second. Do the new
fsck checks slow anything in hash-object down significantly? If so, then
it's plausible to imagine a hash-object caller who (a) doesn't use
`--literally`, but (b) does care about throughput if they're writing a
large number of objects at once.

I don't know if such a situation exists, or if these new fsck checks
even slow hash-object down enough to care. But I didn't catch a
discussion of this case in your series, so I figured I'd bring it up
here just in case.
  - the resulting messages are much better. For example:

      [before]
      $ echo 'tree 123' | git hash-object -t commit --stdin
      error: bogus commit object 0000000000000000000000000000000000000000
      fatal: corrupt commit

      [after]
      $ echo 'tree 123' | git.compile hash-object -t commit --stdin
      error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
      fatal: refusing to create malformed object
Much nicer, well done.
quoted hunk ↗ jump to hunk
Signed-off-by: Jeff King <redacted>
---
 object-file.c          | 55 ++++++++++++++++++------------------------
 t/t1007-hash-object.sh | 11 +++++++++
 2 files changed, 34 insertions(+), 32 deletions(-)
diff --git a/object-file.c b/object-file.c
index 80a0cd3b35..5c96384803 100644
--- a/object-file.c
+++ b/object-file.c
@@ -33,6 +33,7 @@
 #include "object-store.h"
 #include "promisor-remote.h"
 #include "submodule.h"
+#include "fsck.h"

 /* The maximum size for an object header. */
 #define MAX_HEADER_LEN 32
@@ -2298,32 +2299,21 @@ int repo_has_object_file(struct repository *r,
 	return repo_has_object_file_with_flags(r, oid, 0);
 }

-static void check_tree(const void *buf, size_t size)
-{
-	struct tree_desc desc;
-	struct name_entry entry;
-
-	init_tree_desc(&desc, buf, size);
-	while (tree_entry(&desc, &entry))
-		/* do nothing
-		 * tree_entry() will die() on malformed entries */
-		;
-}
-
-static void check_commit(const void *buf, size_t size)
-{
-	struct commit c;
-	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(the_repository, &c, buf, size, 0))
-		die(_("corrupt commit"));
-}
-
-static void check_tag(const void *buf, size_t size)
-{
-	struct tag t;
-	memset(&t, 0, sizeof(t));
-	if (parse_tag_buffer(the_repository, &t, buf, size))
-		die(_("corrupt tag"));
OK, here we're getting rid of all of the lightweight checks that
hash-object used to implement on its own.
quoted hunk ↗ jump to hunk
+/*
+ * We can't use the normal fsck_error_function() for index_mem(),
+ * because we don't yet have a valid oid for it to report. Instead,
+ * report the minimal fsck error here, and rely on the caller to
+ * give more context.
+ */
+static int hash_format_check_report(struct fsck_options *opts,
+				     const struct object_id *oid,
+				     enum object_type object_type,
+				     enum fsck_msg_type msg_type,
+				     enum fsck_msg_id msg_id,
+				     const char *message)
+{
+	error(_("object fails fsck: %s"), message);
+	return 1;
 }

 static int index_mem(struct index_state *istate,
@@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate,
 		}
 	}
 	if (flags & HASH_FORMAT_CHECK) {
-		if (type == OBJ_TREE)
-			check_tree(buf, size);
-		if (type == OBJ_COMMIT)
-			check_commit(buf, size);
-		if (type == OBJ_TAG)
-			check_tag(buf, size);
+		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
+
+		opts.strict = 1;
+		opts.error_func = hash_format_check_report;
+		if (fsck_buffer(null_oid(), type, buf, size, &opts))
+			die(_("refusing to create malformed object"));
+		fsck_finish(&opts);
 	}
And here's the main part of the change, which is delightfully simple and
appears correct to me.
quoted hunk ↗ jump to hunk
diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh
index 2d2148d8fa..ac3d173767 100755
--- a/t/t1007-hash-object.sh
+++ b/t/t1007-hash-object.sh
@@ -222,6 +222,17 @@ test_expect_success 'empty filename in tree' '
 	grep "empty filename in tree entry" err
 '

+test_expect_success 'duplicate filename in tree' '
+	hex_oid=$(echo foo | git hash-object --stdin -w) &&
+	bin_oid=$(echo $hex_oid | hex2oct) &&
+	{
+		printf "100644 file\0$bin_oid" &&
+		printf "100644 file\0$bin_oid"
+	} >tree-with-duplicate-filename &&
+	test_must_fail git hash-object -t tree tree-with-duplicate-filename 2>err &&
+	grep "duplicateEntries" err
+'
+
For what it's worth, I think that this is sufficient coverage for the
new fsck checks.

Thanks,
Taylor
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help