RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver
From: Long Li <longli@microsoft.com>
Date: 2021-07-09 19:44:17
Also in:
linux-hyperv, lkml
Subject: RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver From: Long Li <longli@microsoft.com> Sent: Friday, July 2, 2021 4:59 PMquoted
quoted
PM [snip]quoted
quoted
quoted
+static void az_blob_remove_device(struct az_blob_device *dev) { + wait_event(dev->file_wait, list_empty(&dev->file_list)); + misc_deregister(&az_blob_misc_device); +#ifdef CONFIG_DEBUG_FS + debugfs_remove_recursive(az_blob_debugfs_root); +#endif + /* At this point, we won't get any requests from user-mode +*/ } + +static int az_blob_create_device(struct az_blob_device *dev) { + int rc; + struct dentry *d; + + rc = misc_register(&az_blob_misc_device); + if (rc) { + az_blob_err("misc_register failed rc %d\n", rc); + return rc; + } + +#ifdef CONFIG_DEBUG_FS + az_blob_debugfs_root = debugfs_create_dir("az_blob",NULL);quoted
quoted
quoted
quoted
quoted
+ if (!IS_ERR_OR_NULL(az_blob_debugfs_root)) { + d = debugfs_create_file("pending_requests", 0400, + az_blob_debugfs_root, NULL, + &az_blob_debugfs_fops); + if (IS_ERR_OR_NULL(d)) { + az_blob_warn("failed to create debugfsfile\n");quoted
quoted
quoted
quoted
quoted
+debugfs_remove_recursive(az_blob_debugfs_root);quoted
quoted
quoted
quoted
quoted
+ az_blob_debugfs_root = NULL; + } + } else + az_blob_warn("failed to create debugfs root\n");#endifquoted
quoted
quoted
quoted
quoted
+ + return 0; +} + +static int az_blob_connect_to_vsp(struct hv_device *device, +u32 +ring_size) { + int ret; + + spin_lock_init(&az_blob_dev.file_lock);I'd argue that the spin lock should not be re-initialized here. Here's the sequence where things go wrong: 1) The driver is unbound, so az_blob_remove() is called. 2) az_blob_remove() sets the "removing" flag to true, and calls az_blob_remove_device(). 3) az_blob_remove_device() waits for the file_list to become empty. 4) After the file_list becomes empty, but before misc_deregister() is called, a separate thread opens the device again. 5) In the separate thread, az_blob_fop_open() obtains the file_lock spinlock.quoted
quoted
6) Before az_blob_fop_open() releases the spin lock, az_blob_remove_device() completes, along with az_blob_remove(). 7) Then the device gets rebound, and az_blob_connect_to_vsp() gets called, all while az_blob_fop_open() still holds the spin lock. So the spin lock get re- initialized while it is held. This is admittedly a far-fetched scenario, but stranger things have happened. :-) The issue is that you are counting on the az_blob_dev structure to persist and have a valid file_lock, even while the device is unbound. So any initialization should only happen inaz_blob_drv_init().quoted
I'm not sure if az_blob_probe() and az_blob_remove() can be called at the same time, as az_blob_remove_vmbus() is called the last inaz_blob_remove().quoted
Is it possible for vmbus asking the driver to probe a new channel before the old channel is closed? I expect the vmbus provide guarantee that those calls are made in sequence.In my scenario above, az_blob_remove_vmbus() and az_blob_remove() run to completion in Step #6, all while some other thread is still in the middle of an open() call and holding the file_lock spin lock. Then in Step #7 az_blob_probe() runs. So az_blob_remove() and az_blob_probe() execute sequentially, not at the same time. MichaelI think it's a valid scenario. The return value of devtmpfs_delete_node() is not checked in misc_deregister(). It decreasesthe refcount on inodes but it's not guaranteed that someone else is still using it (in the middle of opening a file).quoted
However, this works fine for "rmmod" that causes device to be removed. Before file is opened the refcount on the module is increased so it can't beremoved when file is being opened. The scenario you described can't happen.quoted
But during VMBUS rescind, it can happen. It's possible that the driver is using the spinlock that has been re-initialized, when the next VMBUSoffer on the same channel comes before all the attempting open file calls exit. I was thinking about the rescind scenario. vmbus_onoffer_rescind() will run on the global workqueue. If it eventually calls az_blob_remove() and then az_blob_remove_device(), it will wait until the file_list is empty, which essentially means waiting until user space processes decide to close the instances they have open. This seems like a problem that could block the global workqueue for a long time and thereby hang the kernel. Is my reasoning valid? If so, I haven't thought about what the solution might be. It seems like we do need to wait until any in-progress requests to Hyper-V are complete because Hyper-V has references to guest physical memory. But waiting for all open instances to be closed seems to be problematic.
My tests showed that misc_deregister() caused all opened files to be released if there are no pending I/O waiting in the driver. If there are pending I/O, we must wait as the VSP owns the memory of the I/O. The correct VSP behavior is to return all the pending I/O along with rescind. This is the same to what storvsc does for rescind. It looks to me waiting for opened files after the call to misc_deregister(), but before removing the vmbus channel is a safe approach. If the VSP is behaving correctly, the rescind process should not block for too long. If we want to deal with a buggy VSP that takes forever to release a resource, we want to create a work queue for rescind handling.
Michaelquoted
This is a very rare. I agree things happen that we should make sure thedriver can handle this. I'll update the driver.quoted
Longquoted
quoted
quoted
quoted
+ INIT_LIST_HEAD(&az_blob_dev.file_list); + init_waitqueue_head(&az_blob_dev.file_wait); + + az_blob_dev.removing = false; + + az_blob_dev.device = device; + device->channel->rqstor_size = device_queue_depth; + + ret = vmbus_open(device->channel, ring_size, ring_size,NULL, 0,quoted
quoted
quoted
quoted
quoted
+ az_blob_on_channel_callback, device-channel);quoted
quoted
quoted
quoted
+ + if (ret) { + az_blob_err("failed to connect to VSP ret %d\n", ret); + return ret; + } + + hv_set_drvdata(device, &az_blob_dev); + + return ret; +} + +static void az_blob_remove_vmbus(struct hv_device *device) { + /* At this point, no VSC/VSP traffic is possible over vmbus */ + hv_set_drvdata(device, NULL); + vmbus_close(device->channel); } + +static int az_blob_probe(struct hv_device *device, + const struct hv_vmbus_device_id *dev_id) { + int rc; + + az_blob_dbg("probing device\n"); + + rc = az_blob_connect_to_vsp(device,az_blob_ringbuffer_size);quoted
quoted
quoted
quoted
quoted
+ if (rc) { + az_blob_err("error connecting to VSP rc %d\n", rc); + return rc; + } + + // create user-mode client library facing device + rc = az_blob_create_device(&az_blob_dev); + if (rc) { + az_blob_remove_vmbus(device); + return rc; + } + + az_blob_dbg("successfully probed device\n"); + return 0; +} + +static int az_blob_remove(struct hv_device *dev) { + struct az_blob_device *device = hv_get_drvdata(dev); + unsigned long flags; + + spin_lock_irqsave(&device->file_lock, flags); + device->removing = true; + spin_unlock_irqrestore(&device->file_lock, flags); + + az_blob_remove_device(device); + az_blob_remove_vmbus(dev); + return 0; +} + +static struct hv_driver az_blob_drv = { + .name = KBUILD_MODNAME, + .id_table = id_table, + .probe = az_blob_probe, + .remove = az_blob_remove, + .driver = { + .probe_type = PROBE_PREFER_ASYNCHRONOUS, + }, +}; + +static int __init az_blob_drv_init(void) { + int ret; + + ret = vmbus_driver_register(&az_blob_drv); + return ret; +} + +static void __exit az_blob_drv_exit(void) { + vmbus_driver_unregister(&az_blob_drv); +}