Thread (7 messages) 7 messages, 3 authors, 2023-02-21

Re: [PATCH v2] md: Use optimal I/O size for last bitmap page

From: Xiao Ni <hidden>
Date: 2023-02-21 05:36:59

On Tue, Feb 21, 2023 at 11:31 AM Jonathan Derrick
[off-list ref] wrote:


On 2/20/2023 7:38 PM, Xiao Ni wrote:
quoted
On Sat, Feb 18, 2023 at 2:36 AM Jonathan Derrick
[off-list ref] wrote:
quoted
From: Jon Derrick <jonathan.derrick@linux.dev>

If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.

This change helps increase performance by preventing unnecessary
device-side read-mod-writes due to non-atomic write unit sizes.
Hi Jonathan

Thanks for your patch.

Could you explain more about the how the optimal io size can affect
the device-side read-mod-writes?
The effects of device-side read-mod-writes are a device-specific implementation detail.
This is not something expected to be universal as its an abstracted detail.

However the NVMe spec allows the storage namespace to provide implementation
hints about its underlying media.

Both NPWA and NOWS are used in the NVMe driver, where NOWS provides the
optimal_io hint:

Per NVMe Command Set 1.0b, Figure 97
Namespace Preferred Write Alignment (NPWA): This field indicates the recommended
write alignment in logical blocks for this namespace

Namespace Optimal Write Size (NOWS): This field indicates the size in logical blocks
for optimal write performance for this namespace.

Per 5.8.2.1 Improved I/O examples (non-normative)
When constructing write operations, the host should minimally construct writes
that meet the recommendations of NPWG and NPWA, but may achieve optimal write
performance by constructing writes that meet the recommendation of NOWS.


It then makes sense that an NVMe drive would provide optimal io size hints
that matches its underlying media unit size, such that sub-4k writes might
invoke a device-side read-mod-write whereas 4k writes become atomic.
Thanks very much for the detailed explanation.

If I want to try it myself, it must use nvme disks that have 4k
optimal_io_size, right?
I tried to check in many environments with this command:
cat /sys/block/nvme4n1/queue/optimal_io_size (it shows 0)

So I need to find a machine that is optimal_io_size is 4096, right?

Regards
Xiao
quoted
Regards
Xiao
quoted
Example Intel/Solidigm P5520
Raid10, Chunk-size 64M, bitmap-size 57228 bits

$ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M
$ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60

Without patch:
  write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets

With patch:
  write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets

Biosnoop:
Without patch:
Time        Process        PID     Device      LBA        Size      Lat
1.410377    md0_raid10     6900    nvme0n1   W 16         4096      0.02
1.410387    md0_raid10     6900    nvme2n1   W 16         4096      0.02
1.410374    md0_raid10     6900    nvme3n1   W 16         4096      0.01
1.410381    md0_raid10     6900    nvme1n1   W 16         4096      0.02
1.410411    md0_raid10     6900    nvme1n1   W 115346512  4096      0.01
1.410418    md0_raid10     6900    nvme0n1   W 115346512  4096      0.02
1.410915    md0_raid10     6900    nvme2n1   W 24         3584      0.43
1.410935    md0_raid10     6900    nvme3n1   W 24         3584      0.45
1.411124    md0_raid10     6900    nvme1n1   W 24         3584      0.64
1.411147    md0_raid10     6900    nvme0n1   W 24         3584      0.66
1.411176    md0_raid10     6900    nvme3n1   W 2019022184 4096      0.01
1.411189    md0_raid10     6900    nvme2n1   W 2019022184 4096      0.02

With patch:
Time        Process        PID     Device      LBA        Size      Lat
5.747193    md0_raid10     727     nvme0n1   W 16         4096      0.01
5.747192    md0_raid10     727     nvme1n1   W 16         4096      0.02
5.747195    md0_raid10     727     nvme3n1   W 16         4096      0.01
5.747202    md0_raid10     727     nvme2n1   W 16         4096      0.02
5.747229    md0_raid10     727     nvme3n1   W 1196223704 4096      0.02
5.747224    md0_raid10     727     nvme0n1   W 1196223704 4096      0.01
5.747279    md0_raid10     727     nvme0n1   W 24         4096      0.01
5.747279    md0_raid10     727     nvme1n1   W 24         4096      0.02
5.747284    md0_raid10     727     nvme3n1   W 24         4096      0.02
5.747291    md0_raid10     727     nvme2n1   W 24         4096      0.02
5.747314    md0_raid10     727     nvme2n1   W 2234636712 4096      0.01
5.747317    md0_raid10     727     nvme1n1   W 2234636712 4096      0.02

Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
---
v1->v2: Add more information to commit message

 drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index e7cc6ba1b657..569297ea9b99 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
        rdev = NULL;
        while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
                int size = PAGE_SIZE;
+               int optimal_size = PAGE_SIZE;
                loff_t offset = mddev->bitmap_info.offset;

                bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
@@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                        int last_page_size = store->bytes & (PAGE_SIZE-1);
                        if (last_page_size == 0)
                                last_page_size = PAGE_SIZE;
-                       size = roundup(last_page_size,
-                                      bdev_logical_block_size(bdev));
+                       size = roundup(last_page_size, bdev_logical_block_size(bdev));
+                       if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
+                               optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
+                       else
+                               optimal_size = size;
                }
+
+
                /* Just make sure we aren't corrupting data or
                 * metadata
                 */
@@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                                goto bad_alignment;
                } else if (offset < 0) {
                        /* DATA  BITMAP METADATA  */
-                       if (offset
-                           + (long)(page->index * (PAGE_SIZE/512))
-                           + size/512 > 0)
+                       loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
+                       if (size != optimal_size &&
+                           off + optimal_size/512 <= 0)
+                               size = optimal_size;
+                       else if (off + size/512 > 0)
                                /* bitmap runs in to metadata */
                                goto bad_alignment;
                        if (rdev->data_offset + mddev->dev_sectors
@@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                                goto bad_alignment;
                } else if (rdev->sb_start < rdev->data_offset) {
                        /* METADATA BITMAP DATA */
-                       if (rdev->sb_start
-                           + offset
-                           + page->index*(PAGE_SIZE/512) + size/512
-                           > rdev->data_offset)
+                       loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
+                       if (size != optimal_size &&
+                           off + optimal_size/512 <= rdev->data_offset)
+                               size = optimal_size;
+                       else if (off + size/512 > rdev->data_offset)
                                /* bitmap runs in to data */
                                goto bad_alignment;
                } else {
--
2.27.0
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help