Thread (24 messages) 24 messages, 8 authors, 2021-07-12

RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver

From: Long Li <longli@microsoft.com>
Date: 2021-07-09 19:44:17
Also in: linux-hyperv, lkml

Subject: RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver

From: Long Li <longli@microsoft.com> Sent: Friday, July 2, 2021 4:59 PM
quoted
quoted
PM

[snip]
quoted
quoted
quoted
+static void az_blob_remove_device(struct az_blob_device *dev) {
+	wait_event(dev->file_wait, list_empty(&dev->file_list));
+	misc_deregister(&az_blob_misc_device);
+#ifdef CONFIG_DEBUG_FS
+	debugfs_remove_recursive(az_blob_debugfs_root);
+#endif
+	/* At this point, we won't get any requests from user-mode
+*/ }
+
+static int az_blob_create_device(struct az_blob_device *dev) {
+	int rc;
+	struct dentry *d;
+
+	rc = misc_register(&az_blob_misc_device);
+	if (rc) {
+		az_blob_err("misc_register failed rc %d\n", rc);
+		return rc;
+	}
+
+#ifdef CONFIG_DEBUG_FS
+	az_blob_debugfs_root = debugfs_create_dir("az_blob",
NULL);
quoted
quoted
quoted
quoted
quoted
+	if (!IS_ERR_OR_NULL(az_blob_debugfs_root)) {
+		d = debugfs_create_file("pending_requests", 0400,
+			az_blob_debugfs_root, NULL,
+			&az_blob_debugfs_fops);
+		if (IS_ERR_OR_NULL(d)) {
+			az_blob_warn("failed to create debugfs
file\n");
quoted
quoted
quoted
quoted
quoted
+
	debugfs_remove_recursive(az_blob_debugfs_root);
quoted
quoted
quoted
quoted
quoted
+			az_blob_debugfs_root = NULL;
+		}
+	} else
+		az_blob_warn("failed to create debugfs root\n");
#endif
quoted
quoted
quoted
quoted
quoted
+
+	return 0;
+}
+
+static int az_blob_connect_to_vsp(struct hv_device *device,
+u32
+ring_size) {
+	int ret;
+
+	spin_lock_init(&az_blob_dev.file_lock);
I'd argue that the spin lock should not be re-initialized here.
Here's the sequence where things go wrong:

1) The driver is unbound, so az_blob_remove() is called.
2) az_blob_remove() sets the "removing" flag to true, and calls
az_blob_remove_device().
3) az_blob_remove_device() waits for the file_list to become empty.
4) After the file_list becomes empty, but before
misc_deregister() is called, a separate thread opens the device again.
5) In the separate thread, az_blob_fop_open() obtains the
file_lock spin
lock.
quoted
quoted
6) Before az_blob_fop_open() releases the spin lock,
az_blob_remove_device() completes, along with az_blob_remove().
7) Then the device gets rebound, and az_blob_connect_to_vsp()
gets called, all while az_blob_fop_open() still holds the spin
lock.  So the spin lock get re- initialized while it is held.

This is admittedly a far-fetched scenario, but stranger things
have happened. :-)  The issue is that you are counting on the
az_blob_dev structure to persist and have a valid file_lock,
even while the device is unbound.  So any initialization should
only happen in
az_blob_drv_init().
quoted
I'm not sure if az_blob_probe() and az_blob_remove() can be called
at the same time, as az_blob_remove_vmbus() is called the last in
az_blob_remove().
quoted
Is it possible for vmbus asking the driver to probe a new channel
before the old channel is closed? I expect the vmbus provide
guarantee that those calls are made in sequence.
In my scenario above, az_blob_remove_vmbus() and az_blob_remove()
run to completion in Step #6, all while some other thread is still
in the middle of an
open() call and holding the file_lock spin lock.  Then in Step #7
az_blob_probe() runs.  So az_blob_remove() and az_blob_probe()
execute sequentially, not at the same time.

Michael
I think it's a valid scenario.  The return value of
devtmpfs_delete_node() is not checked in misc_deregister(). It decreases
the refcount on inodes but it's not guaranteed that someone else is still using
it (in the middle of opening a file).
quoted
However, this works fine for "rmmod" that causes device to be removed.
Before file is opened the refcount on the module is increased so it can't be
removed when file is being opened. The scenario you described can't
happen.
quoted
But during VMBUS rescind, it can happen. It's possible that the driver
is using the spinlock that has been re-initialized, when the next VMBUS
offer on the same channel comes before all the attempting open file calls
exit.

I was thinking about the rescind scenario.  vmbus_onoffer_rescind() will run
on the global workqueue.  If it eventually calls az_blob_remove() and then
az_blob_remove_device(), it will wait until the file_list is empty, which
essentially means waiting until user space processes decide to close the
instances they have open.  This seems like a problem that could block the
global workqueue for a long time and
thereby hang the kernel.   Is my reasoning valid?  If so, I haven't
thought about what the solution might be.  It seems like we do need to wait
until any in-progress requests to Hyper-V are complete because Hyper-V has
references to guest physical memory.  But waiting for all open instances to
be closed seems to be problematic.
My tests showed that misc_deregister() caused all opened files to be released if there are no pending I/O waiting in the driver.

If there are pending I/O, we must wait as the VSP owns the memory of the I/O. The correct VSP behavior is to return all the pending I/O along with rescind. This is the same to what storvsc does for rescind.

It looks to me waiting for opened files after the call to misc_deregister(), but before removing the vmbus channel is a safe approach.

If the VSP is behaving correctly, the rescind process should not block for too long. If we want to deal with a buggy VSP that takes forever to release a resource, we want to create a work queue for rescind handling. 
Michael
quoted
This is a very rare. I agree things happen that we should make sure the
driver can handle this. I'll update the driver.
quoted
Long
quoted
quoted
quoted
quoted
+	INIT_LIST_HEAD(&az_blob_dev.file_list);
+	init_waitqueue_head(&az_blob_dev.file_wait);
+
+	az_blob_dev.removing = false;
+
+	az_blob_dev.device = device;
+	device->channel->rqstor_size = device_queue_depth;
+
+	ret = vmbus_open(device->channel, ring_size, ring_size,
NULL, 0,
quoted
quoted
quoted
quoted
quoted
+			az_blob_on_channel_callback, device-
channel);
quoted
quoted
quoted
quoted
+
+	if (ret) {
+		az_blob_err("failed to connect to VSP ret %d\n", ret);
+		return ret;
+	}
+
+	hv_set_drvdata(device, &az_blob_dev);
+
+	return ret;
+}
+
+static void az_blob_remove_vmbus(struct hv_device *device) {
+	/* At this point, no VSC/VSP traffic is possible over vmbus */
+	hv_set_drvdata(device, NULL);
+	vmbus_close(device->channel); }
+
+static int az_blob_probe(struct hv_device *device,
+			const struct hv_vmbus_device_id *dev_id) {
+	int rc;
+
+	az_blob_dbg("probing device\n");
+
+	rc = az_blob_connect_to_vsp(device,
az_blob_ringbuffer_size);
quoted
quoted
quoted
quoted
quoted
+	if (rc) {
+		az_blob_err("error connecting to VSP rc %d\n", rc);
+		return rc;
+	}
+
+	// create user-mode client library facing device
+	rc = az_blob_create_device(&az_blob_dev);
+	if (rc) {
+		az_blob_remove_vmbus(device);
+		return rc;
+	}
+
+	az_blob_dbg("successfully probed device\n");
+	return 0;
+}
+
+static int az_blob_remove(struct hv_device *dev) {
+	struct az_blob_device *device = hv_get_drvdata(dev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&device->file_lock, flags);
+	device->removing = true;
+	spin_unlock_irqrestore(&device->file_lock, flags);
+
+	az_blob_remove_device(device);
+	az_blob_remove_vmbus(dev);
+	return 0;
+}
+
+static struct hv_driver az_blob_drv = {
+	.name = KBUILD_MODNAME,
+	.id_table = id_table,
+	.probe = az_blob_probe,
+	.remove = az_blob_remove,
+	.driver = {
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
+	},
+};
+
+static int __init az_blob_drv_init(void) {
+	int ret;
+
+	ret = vmbus_driver_register(&az_blob_drv);
+	return ret;
+}
+
+static void __exit az_blob_drv_exit(void) {
+	vmbus_driver_unregister(&az_blob_drv);
+}
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help