Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation

From: Taylor Blau <hidden>
Date: 2023-03-24 23:09:02

On Tue, Mar 21, 2023 at 01:56:12PM -0400, Jeff King wrote:

I like the idea of centralizing the bounds checks here.

I do think copying lseek() is a little awkward, though. As you note, we
only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything
else. Which means we have a run-time check, rather than a compile time
check. If we had two functions like:

  void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
  {
	bitmap_git->map_pos = pos;
	if (bitmap_git->map_pos >= bitmap_git->map_size)
	   ...etc...
  }

  void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset)
  {
	bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset));
  }

then the compiler can see all cases directly. I dunno how much it
matters.

Yeah, I think that trying to copy the interface of lseek() was a
mistake, since it ended up creating more awkwardness than anything else.
I like the combination of bitmap_index_seek_to() and _incr(), though
though I called the latter bitmap_index_seek_set() instead.

We've got to be a little reminiscent of lseek() after all ;-).

The other thing that's interesting from a bounds-checking perspective is
that you're checking the seek _after_ you've read an item. Which has two
implications:

  - I don't think we could ever overflow size_t here.

Yep, agreed. Let's drop the st_add() call there entirely, since it's not
doing anything useful and just serving to confuse the reader.

  - The more interesting case is that we are not overflowing size_t, but
    simply walking past bitmap_git->map_size. But again, we are reading
    first and then seeking. So if our seek does go past, then by
    definition we just read garbage bytes, which is undefined behavior.

    For a bounds-check, wouldn't we want it the other way around? To
    make sure we have 4 bytes available, and then if so read them and
    increment the offset?

I thought on first blush that this would be a much larger change, but
actually it's quite small. The EWAH code already checks its reads ahead
of time, so we don't have to worry about those. It's really that
read_be32() and read_u8() do the read-then-seek.

So I think that the change here is limited to making sure that we can
read enough bytes in those functions before actually executing the read.

quoted

+	if (bitmap_git->map_pos >= bitmap_git->map_size)
+		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
+		    (uintmax_t)bitmap_git->map_pos,
+		    (uintmax_t)bitmap_git->map_size);

This uses ">=", which is good, because it is valid to walk the pointer
to one past the end of the map, which is effectively EOF. But as above,
in that condition the callers should be checking for this EOF state
before reading the bytes.

I think that having both is useful. Checking before you read avoids
undefined behavior. Checking after you seek ensures that we never have a
bitmap_index struct with a bogus map_pos value.

Thanks,
Taylor

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help