Re: [PATCH] sha1-file: remove OBJECT_INFO_SKIP_CACHED

From: Jonathan Tan <hidden>
Date: 2019-12-31 00:39:11

Ooh, I think there's something subtle hiding in this paragraph.

When I first read it, I thought you meant that the repositories are
not self-contained --- that they contain references to the empty tree
but do not fulfill "want"s for them.  I don't believe that's what you
mean, though.

My second reading is the repository genuinely doesn't contain the
empty tree but different Git server implementations handle that
differently.  I tried to reproduce this with

	empty_tree=$(git mktree </dev/null)
	git init --bare x
	git clone --filter=blob:none file://$(pwd)/x y
	cd y
	echo hi >README
	git add README
	git commit -m 'nonempty tree'
	GIT_TRACE=1 git diff-tree "$empty_tree" HEAD

and indeed, it looks like Git serves the empty tree even from
repositories that don't contain it.  By comparison, JGit does not do
the same trick.  So we don't need to refer to a specific repository in
the wild to reproduce this.

All that said, having to fetch this object in the first place is
unexpected.  The question of the promisor remote serving it is only
relevant from the point of view of "why didn't we discover this
sooner?"

Yes, that's true. I'll reword it to emphasize the spurious fetching (and
mention some implementations like JGit not serving those, which
exacerbates the problem).

I think we didn't discover this sooner because not many people directly
enter the empty tree on the command line :-) (but this is a problem that
we should solve, most certainly).

quoted

There are 2 functions: repo_has_object_file() which does not consult
find_cached_object() (which, among other things, knows about the empty
tree); and repo_read_object_file() which does.

Hm, where does this dichotomy come from?  E.g. is the latter a
lower-level function used by the former?

I don't know the reason for the dichotomy - perhaps it was an oversight.
The relevant commits are:

d66b37bb19 ("Add pretend_sha1_file() interface.", 2007-02-05)
Adds a way to pretend that some objects are present by including them in
a cache - empty-tree-pervasiveness is built on top of this later. Only
read_sha1_file() was updated to make use of this; has_sha1_file() and
sha1_object_info() were already present at this time, but were not
updated. (Maybe the latter 2 had no need for pretending?)

346245a1bb ("hard-code the empty tree object", 2008-02-13)
Empty tree pervasiveness built on top of the cache.

c4d9986f5f ("sha1_object_info: examine cached_object store too",
2011-02-07)
sha1_object_info() was updated to use the cache. has_sha1_file() is
never mentioned.

quoted

as an optimization to avoid reading blobs into memory,
parse_object() calls repo_has_object_file() before
repo_read_object_file(). In the case of a regular repository (that is,
not a partial clone), repo_has_object_file() will return false for the
empty tree (thus bypassing the optimization) and repo_read_object_file()
will nevertheless succeed, thus things coincidentally work.

This might be easier to understand if phrased in terms of the
intention behind the code instead of the specific call stacks used.
See f06ab027 for an example of this kind of thing.  For example:

  Applications within and outside of Git benefit from being able to
  assume that every repository contains the empty tree as an object
  (see, for example, commit 9abd46a347 "Make sure the empty tree
  exists when needed in merge-recursive", 2006-12-07).  To this end,
  since 346245a1bb (hard-code the empty tree object, 2008-02-13), Git
  has made the empty tree available in all repositories via
  find_cached_object, which all object access paths can use.

  Object existence checks (has_object_file), however, do not use
  find_cached_object.  <describe reason here>

When I state it this way, it's not obvious why this particular caller
of has_object_file is more relevant than others.  Did I miss some
subtlety?

This particular caller is what caused us to notice this problem.

Indeed.  Can we justify the change even more simply in those terms?

  Object existence checks (has_object_file), however, do not use
  find_cached_object.  <describe reason here>

  This makes the API unnecessarily difficult to reason about.
  Instead, let's consistently view the empty tree as a genuine part of
  the repository, even in has_object_file.  As a side effect, this
  allows us to simplify the common 'has_object_file ||
  find_cached_object' pattern to a more simple existence check.

OK, let me see if I can rewrite it to emphasize the reasoning-about-API
part. I still want to include the user-facing part that caused us to
observe this problem, but I can deemphasize it.

This might be a moot point, but what do you mean by the
"'has_object_file || find_cached_object' pattern"?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help