Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Eric Biggers <ebiggers@kernel.org>
Date: 2022-05-19 03:27:38
Also in:
linux-fsdevel
On Wed, May 18, 2022 at 08:25:26PM -0600, Keith Busch wrote:
On Wed, May 18, 2022 at 07:08:11PM -0700, Eric Biggers wrote:quoted
On Wed, May 18, 2022 at 07:59:36PM -0600, Keith Busch wrote:quoted
I'm aware that spanning pages can cause bad splits on the bi_max_vecs condition, but I believe it's well handled here. Unless I'm terribly confused, which is certainly possible, I think you may have missed this part of the patch:@@ -1223,6 +1224,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); + if (size > 0) + size = ALIGN_DOWN(size, queue_logical_block_size(q)); if (unlikely(size <= 0)) return size ? size : -EFAULT;That makes the total length of each "batch" of pages be a multiple of the logical block size, but individual logical blocks within that batch can still be divided into multiple bvecs in the loop just below it:I understand that, but the existing code conservatively assumes all pages are physically discontiguous and wouldn't have requested more pages if it didn't have enough bvecs for each of them: unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; So with the segment alignment guarantee, and ensured available bvec space, the created bio will always be a logical block size multiple. If we need to split it later due to some other constraint, we'll only split on a logical block size, even if its in the middle of a bvec.
So the bio ends up with a total length that is a multiple of the logical block size, but the lengths of the individual bvecs in the bio are *not* necessarily multiples of the logical block size. That's the problem. Note, there's also lots of code that assumes that bio_vec::bv_len is a multiple of 512. That was implied by it being a multiple of the logical block size. But the DMA alignment can be much lower, like 8 bytes (see nvme_set_queue_limits()). - Eric