Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers

From: René Scharfe <hidden>
Date: 2024-03-07 19:42:46

Possibly related (same subject, not in this thread)

2024-03-13 · Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers · Jeff King <hidden>
2024-03-12 · Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers · <hidden>
2024-03-12 · Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers · Jeff King <hidden>
2024-03-08 · Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers · Phillip Wood <hidden>
2024-03-08 · Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers · Junio C Hamano <hidden>

Am 07.03.24 um 10:26 schrieb Jeff King:

As with the previous patch, we need to swap out single-byte matching for
something like starts_with() to match all bytes of a multi-byte comment
character. But for cases where the buffer is not NUL-terminated (and we
instead have an explicit size or end pointer), it's not safe to use
starts_with(), as it might walk off the end of the buffer.

Let's introduce a new starts_with_mem() that does the same thing but
also accepts the length of the "haystack" str and makes sure not to walk
past it.

Note that in most cases the existing code did not need a length check at
all, since it was written in a way that knew we had at least one byte
available (and that was all we checked). So I had to read each one to
find the appropriate bounds. The one exception is sequencer.c's
add_commented_lines(), where we can actually get rid of the length
check. Just like starts_with(), our starts_with_mem() handles an empty
haystack variable by not matching (assuming a non-empty prefix).

A few notes on the implementation of starts_with_mem():

  - it would be equally correct to take an "end" pointer (and indeed,
    many of the callers have this and have to subtract to come up with
    the length). I think taking a ptr/size combo is a more usual
    interface for our codebase, though, and has the added benefit that
    the function signature makes it harder to mix up the three
    parameters.

  - we could obviously build starts_with() on top of this by passing
    strlen(str) as the length. But it's possible that starts_with() is a
    relatively hot code path, and it should not pay that penalty (it can
    generally return an answer proportional to the size of the prefix,
    not the whole string).

  - it naively feels like xstrncmpz() should be able to do the same
    thing, but that's not quite true. If you pass the length of the
    haystack buffer, then strncmp() finds that a shorter prefix string
    is "less than" than the haystack, even if the haystack starts with
    the prefix. If you pass the length of the prefix, then you risk
    reading past the end of the haystack if it is shorter than the
    prefix. So I think we really do need a new function.

Yes.  xstrncmpz() compares a NUL-terminated string and a length-limited
string.  If you want to check whether the former is a prefix of the
latter then you need to stop comparing when reaching its NUL, and also
after exhausting the latter.  So you need to take both lengths into
account:

int starts_with_mem(const char *str, size_t len, const char *prefix)
{
	size_t prefixlen = strlen(prefix);
	return prefixlen <= len && !xstrncmpz(prefix, str, prefixlen);
}

Using memcmp() here is equivalent and simpler:

int starts_with_mem(const char *str, size_t len, const char *prefix)
{
	size_t prefixlen = strlen(prefix);
	return prefixlen <= len && !memcmp(str, prefix, prefixlen);
}

And your version below avoids function calls and avoids traversing the
strings beyond their common prefix, of course.

Signed-off-by: Jeff King <redacted>
---
Arguably starts_with() and this new function should both be inlined,
like we do for skip_prefix(), but I think that's out of scope for this
series.

Inlining would allow the compiler to unroll the loop for string
constants.  I doubt it would do that for variables, as in the code
below.

Inlining the strlen()+memcmp() version above might allow the compiler
to push the strlen() call out of a loop.

Would any of that improve performance noticeably?  For the call sites
below I doubt it.  But it would probably increase the object text size.

And it's possible I was simply too dumb to figure out xstrncmpz() here.
I'm waiting for René to show up and tell me how to do it. ;)

Nah, it's not a good fit, as it requires the two strings to have the
same length.

quoted hunk ↗ jump to hunk

IMHO this is the trickiest commit of the whole series, as it would be
easy to get the length computations subtly wrong.

 commit.c    |  3 ++-
 sequencer.c |  4 ++--
 strbuf.c    | 11 +++++++++++
 strbuf.h    |  1 +
 trailer.c   |  4 ++--
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/commit.c b/commit.c
index ef679a0b93..531a666cba 100644
--- a/commit.c
+++ b/commit.c

@@ -1796,7 +1796,8 @@ size_t ignored_log_message_bytes(const char *buf, size_t len)
 		else
 			next_line++;

-		if (buf[bol] == comment_line_char || buf[bol] == '\n') {
+		if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) ||
+		    buf[bol] == '\n') {
 			/* is this the first of the run of comments? */
 			if (!boc)
 				boc = bol;

diff --git a/sequencer.c b/sequencer.c
index 991a2dbe96..664986e3b2 100644
--- a/sequencer.c
+++ b/sequencer.c

@@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag)
 static void add_commented_lines(struct strbuf *buf, const void *str, size_t len)
 {
 	const char *s = str;
-	while (len > 0 && s[0] == comment_line_char) {
+	while (starts_with_mem(s, len, comment_line_str)) {
 		size_t count;
 		const char *n = memchr(s, '\n', len);
 		if (!n)

@@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item,
 	/* left-trim */
 	bol += strspn(bol, " \t");

-	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
+	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {

If the strspn() call is safe (which it is, as the caller expects the
string to be NUL-terminated) then you could use starts_with() here and
avoid the length calculation.  But that would also match
comment_line_str values that contain LF, which the _mem version does not
and that's better.

Not sure why lines that start with CR are considered comment lines,
though.

quoted hunk ↗ jump to hunk

 		item->command = TODO_COMMENT;
 		item->commit = NULL;
 		item->arg_offset = bol - buf;

diff --git a/strbuf.c b/strbuf.c
index 7c8f582127..291bdc2a65 100644
--- a/strbuf.c
+++ b/strbuf.c

@@ -24,6 +24,17 @@ int istarts_with(const char *str, const char *prefix)
 			return 0;
 }

+int starts_with_mem(const char *str, size_t len, const char *prefix)
+{
+	const char *end = str + len;
+	for (; ; str++, prefix++) {
+		if (!*prefix)
+			return 1;
+		else if (str == end || *str != *prefix)
+			return 0;
+	}
+}

So this checks whether a length-limited string has a prefix given as a
NUL-terminated string.  I'd have called it mem_starts_with() and have
expected starts_with_mem() to check a NUL-terminated string for a
length-limited prefix (think !strncmp(str, prefix, prefixlen)).

quoted hunk ↗ jump to hunk

+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 				 const char **arg, const char *def)
 {

diff --git a/strbuf.h b/strbuf.h
index 58dddf2777..3156d6ea8c 100644
--- a/strbuf.h
+++ b/strbuf.h

@@ -673,6 +673,7 @@ char *xstrfmt(const char *fmt, ...);

 int starts_with(const char *str, const char *prefix);
 int istarts_with(const char *str, const char *prefix);
+int starts_with_mem(const char *str, size_t len, const char *prefix);

 /*
  * If the string "str" is the same as the string in "prefix", then the "arg"

diff --git a/trailer.c b/trailer.c
index fe18faf6c5..f59c90b4b5 100644
--- a/trailer.c
+++ b/trailer.c

@@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)

 	/* The first paragraph is the title and cannot be trailers */
 	for (s = buf; s < buf + len; s = next_line(s)) {
-		if (s[0] == comment_line_char)
+		if (starts_with_mem(s, buf + len - s, comment_line_str))
 			continue;
 		if (is_blank_line(s))

Another case where starts_with() would be safe to use, as
is_blank_line() expects (and gets) a NUL-terminated string, but it would
allow matching comment_line_str values that contain LF.

quoted hunk ↗ jump to hunk

 			break;

@@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
 		const char **p;
 		ssize_t separator_pos;

-		if (bol[0] == comment_line_char) {
+		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {

We're in the same buffer, so the above comment applies here as well.

 			non_trailer_lines += possible_continuation_lines;
 			possible_continuation_lines = 0;
 			continue;

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help