Re: Recovering from transaction errors [was: Re: [syzbot] INFO: rcu detected stall in tx]
From: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Date: 2021-05-20 21:24:03
Alan Stern wrote:
On Thu, May 20, 2021 at 08:30:01PM +0000, Thinh Nguyen wrote:quoted
Hm... looks like we have a couple of issues in the uas storage class driver and the xhci driver. We may need to fix that in the uas storage driver because it doesn't seem to handle it. (check uas_data_cmplt() in uas.c).Hmmm. I see that if there is an error, the driver assumes no data was transferred. It shouldn't make that assumption; it should always use urb->actual_length. It also doesn't indicate to the SCSI layer when a command gets an error and it doesn't try to do any recovery. Of course, the SCSI error handler may initiate some recovery actions.
Right.
quoted
As for the xhci driver, there maybe a case where the stream URB never gets to complete because the transaction err_count is not properly updated. The err_count for transaction error is stored in ep_ring, but the xhci driver may not be able to lookup the correct ep_ring based on TRB address for streams. There are cases for streams where the event TRBs have their TRB pointer field cleared to '0' (xhci spec section 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, it automatically does a soft-retry. This is seen from one of our testings that the driver was repeatedly doing soft-retry until the class driver timed out. Hi Mathias, maybe you have some comment on this? Thanks.quoted
The fact is that only a small percentage of -EPROTO errors are recoverable. Some of them can be handled by a port reset, which can be pretty awkward to perform but does occasionally work. A lot of them occur because the USB cable has been unplugged; obviously there's no way to recover from that. With only a few exceptions, the best and simplest approach is not to try to recover at all.If the cable is unplugged, then we should get a connection change event and the driver can handle it properly.Yes -- unless the driver is in such a tight retry loop that the rest of the system never gets a chance to process the connection change event. I've seen bug reports where that happened.
I see. I'll keep that in mind, but it sounds like HW issue? The driver handles retry base on events generated from the HW and the HW should properly generate connection event and not get stuck in some loop.
quoted
Yes, it's probably simplest to do a port reset and let the transfer be incomplete/corrupted. However, I think we should give ClearFeature(ep_halt) some more thoughts as I think it can be a recovery mechanism for storage class driver, even though that it may not be foolproof.When you say storage class driver, which one are you talking about, usb-storage or uas? usb-storage already has a pretty robust recovery mechanism.
I mean for both.
quoted
quoted
For the case in question (the syzbot bug report that started this thread), the class driver doesn't try to perform any recovery. It just resubmits the URB, getting into a tight retry loop which consumes too much CPU time. Simply giving up would be preferable. Alan SternI see. By giving up, you mean doing port reset right? Otherwise it needs some other mechanism to synchronize with the device side.No, I mean the driver should just stop communicating with the device. That's an appropriate action for lots of drivers. If the user wants to re-synchronize with the device, he can unplug the USB cable and plug it back in again. Alan Stern
Ok. Would it be more difficult to automate this if it requires user intervention? I assume syzbot doesn't want the user to do that. Thanks, Thinh