Re: [PATCH v4 6/6] io_uring: add support for zone-append
From: Kanchan Joshi <hidden>
Date: 2020-09-28 18:58:43
Also in:
io-uring, linux-block, linux-fsdevel, lkml
On Fri, Sep 25, 2020 at 8:22 AM Damien Le Moal [off-list ref] wrote:
On 2020/09/25 2:20, Kanchan Joshi wrote:quoted
On Tue, Sep 8, 2020 at 8:48 PM hch@infradead.org [off-list ref] wrote:quoted
On Mon, Sep 07, 2020 at 12:31:42PM +0530, Kanchan Joshi wrote:quoted
But there are use-cases which benefit from supporting zone-append on raw block-dev path. Certain user-space log-structured/cow FS/DB will use the device that way. Aerospike is one example. Pass-through is synchronous, and we lose the ability to use io-uring.So use zonefs, which is designed exactly for that use case.Not specific to zone-append, but in general it may not be good to lock new features/interfaces to ZoneFS alone, given that direct-block interface has its own merits. Mapping one file to a one zone is good for some use-cases, but limiting for others. Some user-space FS/DBs would be more efficient (less meta, indirection) with the freedom to decide file-to-zone mapping/placement.There is no metadata in zonefs. One file == one zone and the mapping between zonefs files and zones is static, determined at mount time simply using report zones. Zonefs files cannot be renamed nor deleted in anyway. Choosing a zonefs file *is* the same as choosing a zone. Zonfes is *not* a POSIX file system doing dynamic block allocation to files. The backing storage of files in zonefs is static and fixed to the zone they represent. The difference between zonefs vs raw zoned block device is the API that has to be used by the application, that is, file descriptor representing the entire disk for raw disk vs file descriptor representing one zone in zonefs. Note that the later has *a lot* of advantages over the former: enables O_APPEND use, protects against bugs with user write offsets mistakes, adds consistency of cached data against zone resets, and more.quoted
- Rocksdb and those LSM style DBs would map SSTable to zone, but SSTable file may be two small (initially) and may become too large (after compaction) for a zone.You are contradicting yourself here. If a SSTable is mapped to a zone, then its size cannot exceed the zone capacity, regardless of the interface used to access the zones. And except for L0 tables which can be smaller (and are in memory anyway), all levels tables have the same maximum size, which for zoned drives must be the zone capacity. In any case, solving any problem in this area does not depend in any way on zonefs vs raw disk interface. The implementation will differ depending on the chosen interface, but what needs to be done to map SSTables to zones is the same in both cases.quoted
- The internal parallelism of a single zone is a design-choice, and depends on the drive. Writing multiple zones parallely (striped/raid way) can give better performance than writing on one. In that case one would want to file that seamlessly combines multiple-zones in a striped fashion.Then write a FS for that... Or have a library do it in user space. For the library case, the implementation will differ for zonefs vs raw disk due to the different API (regular file vs block devicer file), but the principles to follow for stripping zones into a single storage object remain the same.
ZoneFS is better when it is about dealing at single-zone granularity, and direct-block seems better when it is about grouping zones (in various ways including striping). The latter case (i.e. grouping zones) requires more involved mapping, and I agree that it can be left to application (for both ZoneFS and raw-block backends). But when an application tries that on ZoneFS, apart from mapping there would be additional cost of indirection/fd-management (due to file-on-files). And if new features (zone-append for now) are available only on ZoneFS, it forces application to use something that maynot be most optimal for its need. Coming to the original problem of plumbing append - I think divergence started because RWF_APPEND did not have any meaning for block device. Did I miss any other reason? How about write-anywhere semantics (RWF_RELAXED_WRITE or RWF_ANONYMOUS_WRITE flag) on block-dev. Zone-append works a lot like write-anywhere on block-dev (or on any other file that combines multiple-zones, in non-sequential fashion).
quoted
Also it seems difficult (compared to block dev) to fit simple-copy TP in ZoneFS. The new command needs: one NVMe drive, list of source LBAs and one destination LBA. In ZoneFS, we would deal with N+1 file-descriptors (N source zone file, and one destination zone file) for that. While with block interface, we do not need more than one file-descriptor representing the entire device. With more zone-files, we face open/close overhead too.Are you expecting simple-copy to allow requests that are not zone aligned ? I do not think that will ever happen. Otherwise, the gotcha cases for it would be far too numerous. Simple-copy is essentially an optimized regular write command. Similarly to that command, it will not allow copies over zone boundaries and will need the destination LBA to be aligned to the destination zone WP. I have not checked the TP though and given the NVMe NDA, I will stop the discussion here.
TP is ratified, if that is the problem you are referring to.
filesend() could be used as the interface for simple-copy. Implementing that in zonefs would not be that hard. What is your plan for simple-copy interface for raw block device ? An ioctl ? filesend() too ? As as with any other user level API, we should not be restricted to a particular device type if we can avoid it, so in-kernel emulation of the feature is needed for devices that do not have simple-copy or scsi extended copy. filesend() seems to me like the best choice since all of that is already implemented there.
At this moment, ioctl as sync and io-uring for async. sendfile() and copy_file_range() takes two fds....with that we can represent copy from one source zone to another zone. But it does not fit to represent larger copy (from N source zones to one destination zone). Not sure if I am clear, perhaps sending RFC would be better for discussion on simple-copy.
As for the open()/close() overhead for zonefs, may be some use cases may suffer from it, but my tests with LevelDB+zonefs did not show any significant difference. zonefs open()/close() operations are way faster than for a regular file system since there is no metadata and all inodes always exist in-memory. And zonefs() now supports MAR/MOR limits for O_WRONLY open(). That can simplify things for the user. -- Damien Le Moal Western Digital Research
-- Joshi