[LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
From: Matias Bjørling <hidden>
Date: 2017-01-06 12:51:31
Also in:
linux-block, linux-fsdevel
On 01/06/2017 02:11 AM, Theodore Ts'o wrote:
On Thu, Jan 05, 2017@10:58:57PM +0000, Slava Dubeyko wrote:quoted
Next point is read disturbance. If BER of physical page/block achieves some threshold then we need to move data from one page/block into another one. What subsystem will be responsible for this activity? The drive-managed case expects that device's GC will manage read disturbance issue. But what's about host-aware or host-managed case? If the host side hasn't information about BER then the host's software is unable to manage this issue. Finally, it sounds that we will have GC subsystem as on file system side as on device side. As a result, it means possible unpredictable performance degradation and decreasing device lifetime. Let's imagine that host-aware case could be unaware about read disturbance management. But how host-managed case can manage this issue?One of the ways this could be done in the ZBC specification (assuming that erase blocks == zones) would be set the "reset" bit in the zone descriptor which is returned by the REPORT ZONES EXT command. This is a hint that the a reset write pointer should be sent to the zone in question, and it could be set when you start seeing soft ECC errors or the flash management layer has decided that the zone should be rewritten in the near future. A simple way to do this is to ask the Host OS to copy the data to another zone and then send a reset write pointer command for the zone.
This is an interesting approach. Currently, the OCSSD interface uses both the soft ECC mark to tell the host to rewrite, while the interface also has an explicit method to make the host rewrite the data. E.g., in the case where read scrubbing on the device requires the host to move data due to durability. Adding the information to the "Report zones" is a good idea. It enables the device to keep a list of "zones" that should be refreshed by the host but have yet to have it done. I will add that to the specification.
So I think it very much could be done, and done within the framework of the ZBC model --- although whether SSD manufactuers will chose to do this, and/or choose to engage the T10/T13 standards committees to add the necessary extensions to the ZBC specification is a question that we probably can't answer in this venue or by the participants on this thread.quoted
Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed and host-aware models. It looks like that the host side should be responsible to manage wear-leveling for the host-managed case. But it means that the host should manage bad blocks and to have direct access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection layer and wear-leveling management will be unavailable on the host side. As a result, device will have internal GC and the traditional issues (possible unpredictable performance degradation and decreasing device lifetime).So I can imagine a setup where the flash translation layer manages the mapping between zone numbers and the physical erase blocks, such that when the host OS issues an "reset write pointer", it immediately gets a new erase block assigned to the specific zone in question. The original erase block would then get erased in the background, when the flash chip in question is available for maintenance activities. I think you've been thinking about a model where *either* the host as complete control over all aspects of the flash management, or the FTL has complete control --- and it may be that there are more clever ways that the work could be split between flash device and the host OS.quoted
Another interesting question... Let's imagine that we create file system volume for one device geometry. It means that geometry details will be stored in the file system metadata during volume creation for the case host-aware or host-managed case. Then we backups this volume and restore the volume on device with completely different geometry. So, what will we have for such case? Performance degradation? Or will we kill the device?This is why I suspect that exposing the full details of the details of the Flash layout via LUNS is a bad, bad, BAD idea. It's much better to use an abstraction such as Zones, and then have an abstraction layer that hides the low-level details of the hardware from the OS. The trick is picking an abstraction that exposes the _right_ set of details so that the division of labor betewen the Host OS and the storage device is at a better place. Hence my suggestion of perhaps providing a virtual mapping layer betewen "Zone number" and the low-level physical erase block.
Agree. The first approach was taken in the first iteration of the specification. After release we began to understand the chaos we just brought onto our self, we moved to the zone/chunk approach in the second iteration to simplify the interface.
quoted hunk ↗ jump to hunk
quoted
I would like to have access channels/LUNs/zones on file system level. If, for example, LUN will be associated with partition then it means that it will need to aggregate several partitions inside of one volume. First of all, not every file system is ready for the aggregation several partitions inside of the one volume. Secondly, what's about aggregation several physical devices inside of one volume? It looks like as slightly tricky to distinguish partitions of the same device and different devices on file system level. Isn't it?Yes, this is why using LUN's are a BAD idea. There's too much code--- in file systems, in the block layer in terms of how we exposeblock devices, etc., that assumes that different LUN's are used for different logical containers of storage. There has been decades of usage of this concept by enterprise storage arrays. Trying to appropriate LUN's for another use case is stupid. And maybe we can't stop OCSSD folks if they have gone down that questionable design path, but there's nothing that says we have to expose it as a SCSI LUN inside of Linux!
Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have been chosen better. In the future, it is being renamed to "parallel unit". For OCSSDs, all the device's parallel units are exposed through the same block device "LUN", which then has to be managed by the layers above.
quoted
OK. But I assume that SMR zone "reset" is significantly cheaper than NAND flash block erase operation. And you can fill your SMR zone with data then "reset" it and to fill again with data without significant penalty.If you have virtual mapping layer between zones and erase blocks, a reset write pointer could be fast for SSD's as well. And that allows the implementation of your suggestion below:quoted
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks like as a hint for SSD controller. If SSD controller receives TRIM for some erase block then it doesn't mean that erase operation will be done immediately. Usually, it should be done in the background because real erase operation is expensive operation.Cheers, - Ted