Re: [bug report] shared tags causes IO hang and performance drop
From: John Garry <hidden>
Date: 2021-04-15 15:44:00
Also in:
linux-scsi
On 15/04/2021 13:18, Ming Lei wrote:
On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:quoted
Hi Ming, I'll have a look. BTW, are you intentionally using scsi_debug over null_blk? null_blk supports shared sbitmap as well, and performance figures there are generally higher than scsi_debug for similar fio settings.I use both, but scsi_debug can cover scsi stack test.
Hi Ming, I can't seem to recreate your same issue. Are you mainline defconfig, or a special disto config? What is am seeing is that scsi_debug throughput is fixed @ ~ 32K IOPS for scsi_debug with both modprobe configs and both none and mq-deadline IO sched. CPU util seems a bit higher for hosttags with none. When I tried null_blk, the performance diff for hosttags and using none IO scheduler was noticeable, but not for mq-deadline: 1) randread test with deadline |IOPS | FIO CPU util ------------------------------------------------ hosttags* | 325K usr=1.34%, sys=76.49% ------------------------------------------------ non hosttags** | 325k usr=1.36%, sys=76.25% 2) randread test with none |IOPS | FIO CPU util ------------------------------------------------ hosttags* |6421k | usr=23.84%, sys=76.06% ------------------------------------------------ non hosttags** | 6893K | usr=25.57%, sys=74.33% ------------------------------------------------ * insmod null_blk.ko submit_queues=32 shared_tag_bitmap=1 ** insmod null_blk.ko submit_queues=32 However I don't think that the null_blk test is a good like-for-like comparison, as setting shared_tag_bitmap means just just the same tagset over all hctx, but still have same count of hctx. Just setting submit_queues=1 gives a big drop in performance, as would be expected. Thanks, John
quoted
EOMquoted
quoted
IOPs mq-deadline usr=21.72%, sys=44.18%, 423K none usr=23.15%, sys=74.01% 450KToday I re-run the scsi_debug test on two server hardwares(32cores, dual numa nodes), and the CPU utilization issue can be reproduced, follow the test result: 1) randread test on ibm-x3850x6[*] with deadline |IOPS | FIO CPU util ------------------------------------------------ hosttags | 94k | usr=1.13%, sys=14.75% ------------------------------------------------ non hosttags | 124k | usr=1.12%, sys=10.65%, 2) randread test on ibm-x3850x6[*] with none |IOPS | FIO CPU util ------------------------------------------------ hosttags | 120k | usr=0.89%, sys=6.55% ------------------------------------------------ non hosttags | 121k | usr=1.07%, sys=7.35% ------------------------------------------------ *: - that is the machine Yanhui reported VM cpu utilization increased by 20% - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) - also run same test on another 32cores machine, IOPS drop isn't observed, but CPU utilization is increased obviously 3) test script #/bin/bash run_fio() { RTIME=$1 JOBS=$2 DEVS=$3 BS=$4 QD=64 BATCH=16 fio --bs=$BS --ioengine=libaio \ --iodepth=$QD \ --iodepth_batch_submit=$BATCH \ --iodepth_batch_complete_min=$BATCH \ --filename=$DEVS \ --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ --name=test --group_reporting } SCHED=$1 NRQS=`getconf _NPROCESSORS_ONLN` rmmod scsi_debug modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 sleep 2 DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler echo 128 >/sys/block/`basename $DEV`/device/queue_depth run_fio 20 16 $DEV 8K rmmod scsi_debug modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 sleep 2 DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler echo 128 >/sys/block/`basename $DEV`/device/queue_depth run_fio 20 16 $DEV 8k