Thread (111 messages) 111 messages, 11 authors, 2020-08-05

Re: [PATCH v4 05/15] diff: halt tree-diff early after max_changes

From: Derrick Stolee <hidden>
Date: 2020-08-04 17:31:14

On 8/4/2020 1:00 PM, SZEDER Gábor wrote:
On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
quoted
On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
quoted
On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
This counter is basically broken, its value is wrong for over 98% of
commits, and, worse, its value remains 0 for over 85% of commits in
the repositories I usually use to test modified path Bloom filters.
Consequently, a relatively large number of commits modifying more than
512 paths get Bloom filters.
Thanks for finding this! The counter is only really tested in one
place, and that test only considers _file adds_, which is a problem.

If I understand this correctly, the bug is a performance-only bug
(since this is a performance-only feature), but it is an important
one to fix.
Or a performance-only feature in a performance-only feature, because
those additional modified path Bloom filters can improve the runtime
of pathspec-limited revision walks (assuming that the false positive
rate is low enough).
quoted
There is certainly some dark magic happening in this tree-diff logic,
so instead of trying to get an accurate count we should just use the
magic global diff_queued_diff to track the current list of file changes.

Note: diff_queued_diff does not track the directory changes, so it
is an under-count for the total changes to track in the Bloom filter.
This is later corrected by the block that adds these leading directory
changes.
quoted
The makeshift tests in the patch below demonstrate these issues as
most of them fail, most notably those two tests that demonstrate that
modifying existing paths are not counted at all.
I adapted your diff along with ripping out 'num_changes' in favor
of diff_queued_diff.nr. This required modifying some of your expected
values in the test script (losing the leading directories in the
count).

I'll work with Taylor to create a fix, and include proper testing
of the logic here. We'll stick it in the v2 of his max-changed-paths
series [1]. He already has some helpful logging that can help create
tests that ensure this logic is performing as expected.
Don't forget to include a check of the hashmap's size, to make sure.
Yes, thanks for the pointer. That check is currently not in there,
since the code assumes the hashmap's size will match num_changes.
Hopefully, the tests I intend to write around this would have caught
such an omission.
 
FWIW, the patch below does result in the correct count (read: the same
as in my implemenation) for all but 4 commits in those repositories I
use for testing, without adding any memory allocations and extra
strcmp() calls.
...
Having said that, the best (i.e faster and accurate) solution to this
issue is probably:

  - Update the callchain between diff_tree_oid() and the diff callback
    functions to allow the callbacks to break diffing with a non-zero
    error code.
It looks like this part would not be too difficult. The pathchange
callback is called by emit_path() which returns a struct combine_diff_path
pointer. This could return NULL to signal an early termination, but
we need to update all callers of the following methods to handle NULL
responses:

 * emit_path()
 * ll_diff_tree_paths()
 * diff_tree_paths()

Of some interest: diff_tree_paths() returns a struct combine_diff_path
pointer, but no callers seem to consume it.
  - Fill Bloom filters using the approach presented in:

      https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/

    but modify the callbacks to return non-zero when too many paths
    have been processed.
Thanks for the pointer to that specific patch. You do a good job of
describing your thought process, including why you used the callback
approach instead of the diff queue approach. The main reason seemed to
be memory overhead from populating the entire diff queue before
checking the limit.

However, if we are using the diff queue as the short-circuit, then
perhaps that memory overhead isn't as much of a problem?

You admit yourself, that

  This patch implements a more efficient, but more complex, approach:

The logic around matching prefixes definitely seems complex and
hard to test, especially around the file/directory changes with the
sort order problems that have plagued similar prefix checks recently.
I'm not doubting your implementation, just saying that the complexity
is worth considering before jumping to that solution too quickly.

To sum up, I intend to start with a fix that uses the diff queue
count as a limit, then try the callback approach to see if there are
measurable improvements in performance.
  - Drop this counter entirely, as there are no other users.
With the callback approach, "this counter" is both num_changes and
max_changes, since the callback would perform all of the short-circuit
logic.

Thanks,
-Stolee
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help