Re: [PATCH V12 2/3] nvmet: add ZBD over ZNS backend support | linux-nvme

quoted

On 3/11/21 23:26, Damien Le Moal wrote:
On 2021/03/12 15:29, Chaitanya Kulkarni wrote:
[...]
+void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req)
+{
+	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
+	u32 bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2;
+	struct nvmet_report_zone_data data = { .ns = req->ns };
+	unsigned int nr_zones;
+	int reported_zones;
+	u16 status;
+
+	status = nvmet_bdev_zns_checks(req);
+	if (status)
+		goto out;
+
+	data.rz = __vmalloc(bufsize, GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO);
+	if (!data.rz) {
+		status = NVME_SC_INTERNAL;
+		goto out;
+	}
+
+	nr_zones = (bufsize - sizeof(struct nvme_zone_report)) /
+			sizeof(struct nvme_zone_descriptor);
+	if (!nr_zones) {
+		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
+		goto out_free_report_zones;
+	}
+
+	reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
+					     nvmet_bdev_report_zone_cb, &data);
+	if (reported_zones < 0) {
+		status = NVME_SC_INTERNAL;
+		goto out_free_report_zones;
+	}
There is a problem here: the code as is ignores the request reporting option
field which can lead to an invalid zone report being returned. I think you need
to modify nvmet_bdev_report_zone_cb() to look at the reporting option field
passed by the initiator and filter the zone report since blkdev_report_zones()
does not handle that argument.
The reporting options are set by the host statistically in
nvme_ns_report_zones()
arefrom:-  nvme_ns_report_zones()
         c.zmr.zra = NVME_ZRA_ZONE_REPORT;
         c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
         c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;

All the above values are validated in the nvmet_bdev_zns_checks() helper
called from nvmet_bdev_execute_zone_mgmt_recv() before we allocate the
report zone buffer.

1. c.zmr.zra indicates the action which Reports zone descriptor entries
   through the Report Zones data structure.

   We validate this value is been set to NVME_ZRA_ZONE_REPORT in the
   nvmet_bdev_zns_chceks(). We are calling report zone after checking
   zone receive action it NVME_ZRA_ZONE_REPORT so not filtering is needed
   in the nvmet_bdev_report_zone_cb().

2. c.zmr.zrasf indicates the action specific field which is set to
   NVME_ZRASF_ZONE_REPORT_ALL.

   We validate this value is been set to NVME_ZRASF_ZONE_REPORT_ALL in the
   nvmet_bdev_zns_chceks(). Since host wants all the zones we don't need to
   filter any zone states in the nvmet_bdev_report_zone_cb().

3. c.zmr.pr is set to NVME_REPORT_ZONE_PARTIAL which value = 1 i.e value in
   the Report Zone data structure Number of Zones field indicates the
number of
   fully transferred zone descriptors in the data buffer, which we set from
   return value of the blkdev_report_zones() :-

   reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
					     nvmet_bdev_report_zone_cb, &data);
<snip>   data.rz->nr_zones = cpu_to_le64(reported_zones);

   So no filtering is needed in nvmet_bdev_report_zone_cb() for c.zmr.pr.

Can you please explain what filtering is missing in the current code ?

Maybe I'm looking into an old spec.
report zones command has the reporting options (ro) field (bits 15:08 of dword
13) where the user can specify the following values:

Value Description
0h List all zones.
1h List the zones in the ZSE:Empty state.
2h List the zones in the ZSIO:Implicitly Opened state.
3h List the zones in the ZSEO:Explicitly Opened state.
4h List the zones in the ZSC:Closed state.
5h List the zones in the ZSF:Full state.
6h List the zones in the ZSRO:Read Only state.
7h List the zones in the ZSO:Offline state.

to filter the zone report based on zone condition. blkdev_report_zones() will
always to a "list all zones", that is, ro == 0h.

But on the initiator side, if the client issue a report zones command through an
ioctl (passthrough/direct access not suing the block layer BLKREPORTZONES
ioctl), it may specify a different value for the ro field. Processing that
command using blkdev_report_zones() like you are doing here without any
additional filtering will give an incorrect report. Filtering based on the user
specified ro field needs to be added in nvmet_bdev_report_zone_cb().

The current code here is fine of the initiator/client side uses the block layer
and execute all report zones through blkdev_report_zones(). But things will
break if the client starts doing passthrough commands using nvme ioctl. No ?
Okay, so I'm using the right spec. With your explanation it needs a
filtering,
will add it to the next version.

+
+	data.rz->nr_zones = cpu_to_le64(reported_zones);
+
+	status = nvmet_copy_to_sgl(req, 0, data.rz, bufsize);
+
+out_free_report_zones:
+	kvfree(data.rz);
+out:
+	nvmet_req_complete(req, status);
+}
+
+void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req)
+{
+	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zms.slba);
+	sector_t nr_sect = bdev_zone_sectors(req->ns->bdev);
+	u16 status = NVME_SC_SUCCESS;
+	u8 zsa = req->cmd->zms.zsa;
+	enum req_opf op;
+	int ret;
+	const unsigned int zsa_to_op[] = {
+		[NVME_ZONE_OPEN]	= REQ_OP_ZONE_OPEN,
+		[NVME_ZONE_CLOSE]	= REQ_OP_ZONE_CLOSE,
+		[NVME_ZONE_FINISH]	= REQ_OP_ZONE_FINISH,
+		[NVME_ZONE_RESET]	= REQ_OP_ZONE_RESET,
+	};
+
+	if (zsa > ARRAY_SIZE(zsa_to_op) || !zsa_to_op[zsa]) {
What is the point of the "!zsa_to_op[zsa]I see, I'll add the async I/O interface." here ? All the REQ_OP_ZONE_XXX are
non 0, always...
Well this is just making sure that we receive the right action since sparse
array will return 0 for any other values than listed above having
!zsa_to_op[zsa] check we can return an error.
But zsa is unsigned and you check it against the array size. So it can only be
within 0 and array size - 1. That is enough... I really do not see the point of
clutering the condition with something that is always true...
Okay.

[...]
+
+	ret = blkdev_zone_mgmt(req->ns->bdev, op, sect, nr_sect, GFP_KERNEL);
+	if (ret)
+		status = NVME_SC_INTERNAL;
+out:
+	nvmet_req_complete(req, status);
+}
+
+void nvmet_bdev_execute_zone_append(struct nvmet_req *req)
+{
+	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->rw.slba);
+	u16 status = NVME_SC_SUCCESS;
+	unsigned int total_len = 0;
+	struct scatterlist *sg;
+	int ret = 0, sg_cnt;
+	struct bio *bio;
+
+	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
+		return;
No nvmet_req_complete() call ? Is that done in nvmet_check_transfer_len() ?
Yes it does, you had the same comment on earlier version, it can be
confusing
that is why I proposed a helper for check transfer len and !req->sg_cnt
check,
but we don't want that helper.
Just add a comment saying that nvmet_check_transfer_len() calls
nvmet_req_complete(). That will help people like me with a short memory span :)
Okay.

+
+	if (!req->sg_cnt) {
+		nvmet_req_complete(req, 0);
+		return;
+	}
+
+	if (req->transfer_len <= NVMET_MAX_INLINE_DATA_LEN) {
+		bio = &req->b.inline_bio;
+		bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
+	} else {
+		bio = bio_alloc(GFP_KERNEL, req->sg_cnt);
+	}
+
+	bio_set_dev(bio, req->ns->bdev);
+	bio->bi_iter.bi_sector = sect;
+	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
+	if (req->cmd->rw.control & cpu_to_le16(NVME_RW_FUA))
+		bio->bi_opf |= REQ_FUA;
+
+	for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) {
+		struct page *p = sg_page(sg);
+		unsigned int l = sg->length;
+		unsigned int o = sg->offset;
+
+		ret = bio_add_zone_append_page(bio, p, l, o);
+		if (ret != sg->length) {
+			status = NVME_SC_INTERNAL;
+			goto out_bio_put;
+		}
+
+		total_len += sg->length;
+	}
+
+	if (total_len != nvmet_rw_data_len(req)) {
+		status = NVME_SC_INTERNAL | NVME_SC_DNR;
+		goto out_bio_put;
+	}
+
+	ret = submit_bio_wait(bio);
submit_bio_wait() ? Why blocking here ? That would be bad for performance. Is it
mandatory to block here ? The result handling could be done in the bio_end
callback no ?
I did initially, but zonefs uses sync I/O, I'm not sure about the btrfs,
if it does
please let me know I'll make it async.

If there is no async caller in the kernel for REQ_OP_ZONE_APPEND
shouldwe make this async ?
This should not depend on what the user does, at all.
Yes, for now zonefs only uses zone append for sync write() calls. But I intend
to have zone append used for async writes too. And in btrfs, append writes are
used for all data IOs, sync or async. That is on the initiator side anyway. The
target side should not assume any special behavior of the initiator. So if there
is no technical reasons to prevent async append writes execution, I would rather
have all of them processed with async BIOs execution, similar to regular write BIOs.
Thanks for the explanation, I was not aware about the async interface.
I'll make it async.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help