Re: [PATCH v2 2/2] libata-core: do not set dev->max_sectors for LBA48 devices
From: Tom Yan <hidden>
Date: 2016-08-11 09:31:00
Also in:
linux-ide, linux-scsi
On 11 August 2016 at 11:37, Martin K. Petersen [off-list ref] wrote:
quoted
quoted
quoted
quoted
quoted
"Tom" == Tom Yan [off-list ref] writes:I don't agree with conflating the optimal transfer size and the maximum supported ditto. Submitting the largest possible I/O to a device does not guarantee that you get the best overall performance. - max_hw_sectors is gated by controller DMA constraints. - max_dev_sectors is set for devices that explicitly report a transfer length limit. - max_sectors, the soft limit for filesystem read/write requests, should be left at BLK_DEF_MAX_SECTORS unless the device explicitly requests transfers to be aligned multiples of a different value (typically the internal stripe size in large arrays).
Shouldn't we use Maximum Transfer Length to derive max_sectors (and get rid of the almost useless max_dev_sectors)? Honestly it looks pretty non-sensical to me that the SCSI disk driver uses Optimal Transfer Length for max_sectors. Also in libata's case, this make setting the effective max_sectors (e.g. see ATA_HORKAGE_MAX_SEC_LBA48) impossible if we do not want to touch io_opt. It would look to me that our block layer simply have a flawed design if we really need to derive both io_opt and max_sectors from the same field.
The point of BLK_DEF_MAX_SECTORS is to offer a reasonable default for common workloads unless otherwise instructed by the storage device. We can have a discussion about what the right value for BLK_DEF_MAX_SECTORS should be. It has gone up over time but it used to be the case that permitting large transfers significantly impacted interactive I/O performance. And finding a sweet spot that works for a wide variety of hardware, interconnects and workloads is obviously non-trivial.
If BLK_DEF_MAX_SECTORS is supposed to be used as a fallback, then it should be a safe value, especially when max_sectors_kb can be adjusted through sysfs. But the biggest problem isn't on bumping it, but the value picked is totally irrational for a general default. I mean, given that it was 1024 (512k), try to double it? Fine. Try to quadruple it? Alright. We'll need to deal with some alignment / boundary issue (like the typical 65535 vs 65536 case)? Okay let's do it. But what's the sense in picking a random RAID configuartion as the base to decide the default? Also, if max_sectors need to concern about the number of disks used and chunk sizes in a RAID configuartion, it should be calculated in the device-mapper layer or a specific driver or so. Changing a block layer default won't help anyway. Say 2560 will accomodate a 10-disk 128k-chunk RAID. What about a 12-disk 128k-chunk RAID then? Why not just decide the value base on an 8-disk 128k-chunk RAID, which HAPPENED to be a double of 1024 as well? It does not make sense that the SCSI disk driver uses it as the fallback either. SCSI host templates that does not have max_sectors set (as well as some specific driver) will use SCSI_DEFAULT_MAX_SECTORS as the fallback, for such hosts max_hw_sectors will be 1024, where the current BLK_DEF_MAX_SECTORS cannot apply as max_sectors anyway. So we should use also SCSI_DEFAULT_MAX_SECTORS in the SCSI disk driver as fallback for max_sectors. If the value is considered to low even as a safe fallback, then it should be bumped appropriately. (Or we might want to replace it with BLK_DEF_MAX_SECTORS everywhere in the SCSI layer, that said, after the value is fixed.)
-- Martin K. Petersen Oracle Linux Engineering