Re: [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
From: Nikos Tsironis <hidden>
Date: 2022-03-08 20:48:27
Also in:
dm-devel, linux-fsdevel, linux-nvme, linux-scsi
On 3/1/22 23:32, Chaitanya Kulkarni wrote:
Nikos,quoted
quoted
[8] https://kernel.dk/io_uring.pdfI would like to participate in the discussion too. The dm-clone target would also benefit from copy offload, as it heavily employs dm-kcopyd. I have been exploring redesigning kcopyd in order to achieve increased IOPS in dm-clone and dm-snapshot for small copies over NVMe devices, but copy offload sounds even more promising, especially for larger copies happening in the background (as is the case with dm-clone's background hydration). Thanks, NikosIf you can document your findings here it will be great for me to add it to the agenda.
My work focuses mainly on improving the IOPs and latency of the dm-snapshot target, in order to bring the performance of short-lived snapshots as close as possible to bare-metal performance. My initial performance evaluation of dm-snapshot had revealed a big performance drop, while the snapshot is active; a drop which is not justified by COW alone. Using fio with blktrace I had noticed that the per-CPU I/O distribution was uneven. Although many threads were doing I/O, only a couple of the CPUs ended up submitting I/O requests to the underlying device. The same issue also affects dm-clone, when doing I/O with sizes smaller than the target's region size, where kcopyd is used for COW. The bottleneck here is kcopyd serializing all I/O. Users of kcopyd, such as dm-snapshot and dm-clone, cannot take advantage of the increased I/O parallelism that comes with using blk-mq in modern multi-core systems, because I/Os are issued only by a single CPU at a time, the one on which kcopyd’s thread happens to be running. So, I experimented redesigning kcopyd to prevent I/O serialization by respecting thread locality for I/Os and their completions. This made the distribution of I/O processing uniform across CPUs. My measurements had shown that scaling kcopyd, in combination with scaling dm-snapshot itself [1] [2], can lead to an eventual performance improvement of ~300% increase in sustained throughput and ~80% decrease in I/O latency for transient snapshots, over the null_blk device. The work for scaling dm-snapshot has been merged [1], but, unfortunately, I haven't been able to send upstream my work on kcopyd yet, because I have been really busy with other things the last couple of years. I haven't looked into the details of copy offload yet, but it would be really interesting to see how it affects the performance of random and sequential workloads, and to check how, and if, scaling kcopyd affects the performance, in combination with copy offload. Nikos [1] https://lore.kernel.org/dm-devel/20190317122258.21760-1-ntsironis@arrikto.com/ (local) [2] https://lore.kernel.org/dm-devel/425d7efe-ab3f-67be-264e-9c3b6db229bc@arrikto.com/ (local)