Re: [PATCH v6 3/4] SCT Write Same / DSM Trim

From: Tom Yan <hidden>
Date: 2016-08-22 18:52:33
Also in: lkml

In that case I see no reason that my suggestion should not be adopted.
Currently speaking (as I mentioned in the commit message) it is
reasonable to decrease it per logical sector size in the TRIM-only
sense as well because of the block layer / bio size limit.

FWIW, as of ACS-4, RANGE LENGTH field in DSM(XL)/TRIM range entry
should be "number of logical sectors". Therefore, if there will be any
4Kn SSDs, _bytes_ TRIM'd per command is expected to be increased with
the sector size as well, just like SCT Write Same.

For the "payload block size" that is "always" 512-byte as per the same
spec, I don't think we need to concern about it. I think it only
matters if we want to enable multi-block TRIM payload according to the
reported limit in IDENTIFY DEVICE data. Probably it merely means that
the "each of the blocks" in the reported limit will always mean 64
TRIM entries, even on 4Kn drives, instead of 512.

On 23 August 2016 at 02:00, Shaun Tancheff [off-list ref] wrote:

On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan [off-list ref] wrote:

quoted

On 22 August 2016 at 15:04, Shaun Tancheff [off-list ref] wrote:

quoted

On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan [off-list ref] wrote:

quoted

On 22 August 2016 at 08:31, Tom Yan [off-list ref] wrote:

quoted

Since I have no experience with SCT Write Same at all, and neither do
I own any spinning HDD at all, I cannot firmly suggest what to do. All
I can suggest is: should we decrease it per sector size? Or would 2G
per command still be too large to avoid timeout?

Timeout for WS is 120 seconds so we should be fine there.

The number to look for is the:
   Max. Sustained Transfer Rate OD (MB/s): 190 8TB (180 5TB)

Which means the above drives should complete a 2G write in
about 10 to 11 seconds.

If these were 4Kn drives and we allowed a 16G max then it
would be 80-90 seconds, assuming the write speed didn't
get any better.

So holding the maximum to around 2G is probably the best
overall, in my opinion.

--
Shaun Tancheff

On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan [off-list ref] wrote:

quoted

On 22 August 2016 at 15:04, Shaun Tancheff [off-list ref] wrote:

quoted

On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan [off-list ref] wrote:

quoted

On 22 August 2016 at 08:31, Tom Yan [off-list ref] wrote:

quoted

As mentioned before, as of the latest draft of ACS-4, nothing about a
larger payload size is mentioned. Conservatively speaking, it sort of

*payload block size

quoted

means that we are allowing four 512-byte block payload on 4Kn device

*eight 512-byte-block payload

quoted

regardless of the reported limit in the IDENTIFY DEVICE data. I am
really not sure if it's a good thing to do. Doesn't seem necessary
anyway, especially when our block layer does not support such a large
bio size (well, yet), so each request will end up using a payload of
two 512-byte blocks at max anyway.

Also, it's IMHO better to do it in a seperate patch (series) after the
SCT Write Same support has entered libata's repo too, because this has
nothing to with it but TRIM translation. In case the future ACS
standards has clearer/better instruction on this, it will be easier
for us to revert/follow up too.

I am certainly fine with dropping this patch as it is not critical to
the reset of the series.

Nothing will break if we stick with the 512 byte fixed limit. This
is at most a prep patch for handling increased limits should
they be reported.

All it really is doing is acknowledging that any write same
must have a payload of sector_size which can be something
larger than 512 bytes.

Actually I am not sure if we should hard code the limit
ata_format_dsm_trim_descr() / ata_set_lba_range_entries() at all. The
current implementation (with or without your patch) seems redundant
and unnecessary to me.

All we need to do should be: making sure that the block limits VPD
advertises a safe Maximum Write Same Length, and reject Write Same
(16) commands that have "number of blocks" that exceeds the limit
(which is what I introduced in commit 5c79097a28c2, "libata-scsi:
reject WRITE SAME (16) with n_block that exceeds limit").

In that case, we don't need to hard code the limit in the
while-condition again; instead we should just make it end with the
request size, since the accepted request could never be larger than
the limit we advertise.

quoted

And you'll need to fix the Block Limits VPD simulation
(ata_scsiop_inq_b0) too, so that it will advertise the Maximum Write
Same Length dynamically as per the logical sector size, otherwise your
effort will be completely in vain, even if our block layer is
overhauled in the future.

Martin had earlier suggested that I leave the write same defaults
as is due to concerns with misbehaving hardware.

It doesn't really apply in libata's anyway. SD_MAX_WS10_BLOCKS means
nothing to ATA drives, except from coincidentally being the same value
as ATA_MAX_SECTORS_LBA48 (which technically should have been 65536
instead).

quoted

I think your patch adjusting the reported limits is reasonable
enough. It seems to me we should have the hardware
report it's actual limits, for example, report what the spec
allows.

As you mentioned yourself before, technically SCT Write Same does not
have a limit. The only practical limit is the timeout in the SCSI
layer, so the actual bytes being (over)written is probably our only
concern.

For the case of TRIM, devices do report a limit in their IDENTIFY
DEVICE data. However, as Martin always said, it is not an always-safe
piece of data for us to refer to, that's why we have been statically
allowing only 1-block payload.

Therefore, it seems convenient (and consistent) that we make SCT Write
Same always use the same limit as TRIM, no matter if it is supported
on a certain device. And to make sure the actual bytes being written /
time required per command does not increase enormously as per the
sector size, we decrease the limit accordingly. Certainly that's not
necessary if 16G per command is fine on most devices.

Also, does SCT Write Same commands that write 32M/256M per command
make any sense? I mean would we benefit from such small SCT Write Same
commands at all?

quoted

Of course there are lots of reasons to limit the absolute
maximums.

So in this case we are just enabling the limit to be
increased but not changing the current black-listing
that distrusts DSM Trim. Once we have 4Kn devices
to test then we can start white-listing and see if there
is an overall increase in performance.

quoted

Please be noted that, since your haven't touched ata_scsiop_inq_b0 at
all, the reported Maximum Write Same Length will be:

On device with TRIM support:
- 4194240 LOGICAL sector per request split / command
-- ~=2G on non-4Kn drives
-- ~=16G on non-4Kn drives

On device without TRIM support:
- 0 --> SD_MAX_WS10_BLOCKS (65535) per request split / command
-- ~= 32M on non-4Kn drives
-- ~=256M on non-4Kn drives

Even if we ignore the upper limit(s) of the block layer, do we want
such inconsistencies?

Hmm. Overall I think it is still okay if a bit confusing.
It is possible that for devices which support SCT Write Same
and DSM TRIM will still Trim faster than they can Write Same,
However the internal implementation is opaque so I can't
say if Write Same is often implemented in terms of TRIM
or not. I mean that's how _I_ do it [Write 1 block and map
N blocks to it], But not every FTL will have come to the
same conclusion.

Why would SCT Write Same be implemented in terms of TRIM? Neither
would we need to care about that anyway. Considering we will unlikely
allow multi-block payload TRIM, and we probably have no reason to
touch the SCSI Write Same timeout, the only thing we need to consider
is whether we want to decrease the advertised limit base on the
typical SCT Write Same speed on traditional HDDs and the timeout,
especially in the 4Kn case.

Since I have no experience with SCT Write Same at all, and neither do
I own any spinning HDD at all, I cannot firmly suggest what to do. All
I can suggest is: should we decrease it per sector size? Or would 2G
per command still be too large to avoid timeout?

quoted

I also suspect that given the choice for most use casess that
TRIM is preferred over WS when TRIM supports returning
zeroed blocks.

Well, for devices with discard_zeroes_data = 1, the block layer will
not issue write same requests (see blkdev_issue_zeroout in
block/blk-lib.c). However, libata only consider the RZAT support bit
from a white list of devices (see ata_scsiop_read_cap in libata-scsi
and the white list in libata-core).

quoted

--
Shaun Tancheff



--
Shaun Tancheff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help