Thread (6 messages) 6 messages, 3 authors, 2025-08-07

Re: [PATCH] diff: ensure consistent diff behavior with -I<regex> across output formats

From: Lidong Yan <hidden>
Date: 2025-08-03 08:43:04

Possibly related (same subject, not in this thread)

On Sun, Aug 2, 2025 at 06:22PM, Jeff King [off-list ref] wrote:
quoted hunk ↗ jump to hunk
So here's a naive application of the same technique:
diff --git a/diff.c b/diff.c
index 76291e238c..0fe6eb7443 100644
--- a/diff.c
+++ b/diff.c
@@ -6845,8 +6845,28 @@ void diff_flush(struct diff_options *options)
    DIFF_FORMAT_CHECKDIFF)) {
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
- if (check_pair_status(p))
- flush_one_pair(p, options);
+
+ if (!check_pair_status(p))
+ continue;
+
+ if (options->flags.diff_from_contents) {
+ FILE *orig_out = options->file;
+ int orig_changes = options->found_changes;
+ int skip;
+
+ options->file = xfopen("/dev/null", "w");
+ diff_flush_patch(p, options);
+ skip = !options->found_changes;
+
+ fclose(options->file);
+ options->file = orig_out;
+ options->found_changes = orig_changes;
+
+ if (skip)
+ continue;
+ }
+
+ flush_one_pair(p, options);
}
separator++;
}

which works on a trivial example. It affects all of raw, name-only,
name-status, and checkdiff. I know Junio said that --raw should not be
affected, but I'm not sure I agree. Anyway, it should be possible to
split the logic by output type.
I think I could do the same thing in diffcore_ignore(). Like:

+void diffcore_ignore(struct diff_options *o)
+{
+	struct diff_queue_struct *q = &diff_queued_diff;
+	struct diff_queue_struct outq = DIFF_QUEUE_INIT;
+
+	if (!(o->output_format &
+	    (DIFF_FORMAT_NAME |
+	     DIFF_FORMAT_NAME_STATUS)))
+		return;
+
+	for (int i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		if (ignore_match(p, o))
+			diff_free_filepair(p);
+		else
+			diff_q(&outq, p);
+	}
+
+	free(q->queue);
+	*q = outq;
+}

And ignore_match(p, o) will run xdl_diff() for file pair p. This approach
ensures that the behavior of `git diff --raw` and `git diff --check` remains
unaffected.
I'm not sure if stuff like --stat would need something similar. It's
already doing a content comparison, so presumably it handles it
internally. Maybe stuff like --dirstat would need it, too? In which case
we'd maybe want to annotate each filepair in an initial loop with
whether it's modified at the content-level, and then take that into
account in various code paths.

And of course it's horribly hacky looking. Some refactoring might help.
Certainly it is silly to open /dev/null each time through the loop.
There might also be a better way of checking whether the diff found
anything than the found_changes flag.

So this is really just sketching out the direction, and somebody would
need to figure out the details.
Seems like compute_diffstat() will run xdl_diff() to fill diffstat. Since diff_flush()
calls compute_diffstat() for --stat, --dirstat=lines, --shortstat and --namestat, we
shouldn’t run an extra xdl_diff() for them. I don’t know if --dirstat=files would
need an extra xdl_diff() though.
quoted
* Also, should we internally run diff twice, especially even when
  we are going to show the patch output and are not limited to
  FORMAT_NAME and FORMAT_NAME_STATUS?  Generally, running the real
  diff in any of the diffcore transformatin is a sign of trouble.
The patch above also runs the diff twice for "-Ifoo --name-only -p". But
I think we are kind of stuck there. We want to show all name-only
entries before any content diffs. So either we have to run the content
diff twice, or we have to buffer it to show after we decide whether to
show name-only entries.
I haven’t thought of a good way to avoid running the diff twice either.
Caching the diff content seems quite complicated. Moreover, `git diff -G<regex> -p`
requires running the diff twice: the first diff is used to filter out file pairs
that don’t match, and the second diff outputs the patch.

On Tue, Jul 29, 2025 at 05:28:00PM -0700, Junio C Hamano wrote:
The enthusiasm is appreciated, but the implementation raises two
questions.

* This special cases -I<pattern>, but any option that causes us to
  set the .diff_from_contents flag, not just -I<pattern>, can cause
  the raw blob comparison to be potentially different from what the
  blob contents are compared with various "ignore this class of
  changes" criteria.  Shouldn't "git diff -w --name-status" and the
  like get the same treatment?
Yes, I think I could modify ignore_match() to support -w and other ignore options.
Also, the usual way to compose a log message of this project is to

    - Give an observation on how the current system works in the
      present tense (so no need to say "Currently X is Y", or
      "Previously X was Y" to describe the state before your change;
      just "X is Y" is enough), and discuss what you perceive as a
      problem in it.

    - Propose a solution (optional---often, problem description
      trivially leads to an obvious solution in reader's minds).

    - Give commands to somebody editing the codebase to "make it so",
      instead of saying "This commit does X".

in this order.
Thank you once again for patiently explaining how to write proper log messages.
I’ve made notes and am ready to apply them to future commits.

Thanks,
Lidong

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help