[LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os | linux-nvme

[LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

From: Matias Bjørling <hidden>
Date: 2017-01-06 12:51:31
Also in: linux-block, linux-fsdevel

On 01/06/2017 02:11 AM, Theodore Ts'o wrote:

On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote:

quoted

Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?

One of the ways this could be done in the ZBC specification (assuming
that erase blocks == zones) would be set the "reset" bit in the zone
descriptor which is returned by the REPORT ZONES EXT command.  This is
a hint that the a reset write pointer should be sent to the zone in
question, and it could be set when you start seeing soft ECC errors or
the flash management layer has decided that the zone should be
rewritten in the near future.  A simple way to do this is to ask the
Host OS to copy the data to another zone and then send a reset write
pointer command for the zone.

This is an interesting approach. Currently, the OCSSD interface uses
both the soft ECC mark to tell the host to rewrite, while the interface
also has an explicit method to make the host rewrite the data. E.g., in
the case where read scrubbing on the device requires the host to move
data due to durability.

Adding the information to the "Report zones" is a good idea. It enables
the device to keep a list of "zones" that should be refreshed by the
host but have yet to have it done. I will add that to the specification.

So I think it very much could be done, and done within the framework
of the ZBC model --- although whether SSD manufactuers will chose to
do this, and/or choose to engage the T10/T13 standards committees to
add the necessary extensions to the ZBC specification is a question
that we probably can't answer in this venue or by the participants on
this thread.

quoted

Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime).

So I can imagine a setup where the flash translation layer manages the
mapping between zone numbers and the physical erase blocks, such that
when the host OS issues an "reset write pointer", it immediately gets
a new erase block assigned to the specific zone in question.  The
original erase block would then get erased in the background, when the
flash chip in question is available for maintenance activities.

I think you've been thinking about a model where *either* the host as
complete control over all aspects of the flash management, or the FTL
has complete control --- and it may be that there are more clever ways
that the work could be split between flash device and the host OS.

quoted

Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?

This is why I suspect that exposing the full details of the details of
the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
to use an abstraction such as Zones, and then have an abstraction
layer that hides the low-level details of the hardware from the OS.
The trick is picking an abstraction that exposes the _right_ set of
details so that the division of labor betewen the Host OS and the
storage device is at a better place.  Hence my suggestion of perhaps
providing a virtual mapping layer betewen "Zone number" and the
low-level physical erase block.

Agree. The first approach was taken in the first iteration of the
specification. After release we began to understand the chaos we just
brought onto our self, we moved to the zone/chunk approach in the second
iteration to simplify the interface.

quoted hunk ↗ jump to hunk

quoted

I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?

Yes, this is why using LUN's are a BAD idea.  There's too much code

--- in file systems, in the block layer in terms of how we expose

block devices, etc., that assumes that different LUN's are used for
different logical containers of storage.  There has been decades of
usage of this concept by enterprise storage arrays.  Trying to
appropriate LUN's for another use case is stupid.  And maybe we can't
stop OCSSD folks if they have gone down that questionable design path,
but there's nothing that says we have to expose it as a SCSI LUN
inside of Linux!

Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have
been chosen better. In the future, it is being renamed to "parallel
unit". For OCSSDs, all the device's parallel units are exposed through
the same block device "LUN", which then has to be managed by the layers
above.

quoted

OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.

If you have virtual mapping layer between zones and erase blocks, a
reset write pointer could be fast for SSD's as well.  And that allows
the implementation of your suggestion below:

quoted

Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation.

Cheers,

					- Ted

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help