Thread (17 messages) 17 messages, 5 authors, 2021-05-24

Re: Recovering from transaction errors [was: Re: [syzbot] INFO: rcu detected stall in tx]

From: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Date: 2021-05-20 21:24:03

Alan Stern wrote:
On Thu, May 20, 2021 at 08:30:01PM +0000, Thinh Nguyen wrote:
quoted
Hm... looks like we have a couple of issues in the uas storage class
driver and the xhci driver.

We may need to fix that in the uas storage driver because it doesn't
seem to handle it. (check uas_data_cmplt() in uas.c).
Hmmm.  I see that if there is an error, the driver assumes no data was 
transferred.  It shouldn't make that assumption; it should always use 
urb->actual_length.

It also doesn't indicate to the SCSI layer when a command gets an error 
and it doesn't try to do any recovery.  Of course, the SCSI error 
handler may initiate some recovery actions.
Right.
quoted
As for the xhci driver, there maybe a case where the stream URB never
gets to complete because the transaction err_count is not properly
updated. The err_count for transaction error is stored in ep_ring, but
the xhci driver may not be able to lookup the correct ep_ring based on
TRB address for streams. There are cases for streams where the event
TRBs have their TRB pointer field cleared to '0' (xhci spec section
4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
it automatically does a soft-retry. This is seen from one of our
testings that the driver was repeatedly doing soft-retry until the class
driver timed out.

Hi Mathias, maybe you have some comment on this? Thanks.
quoted
The fact is that only a small percentage of -EPROTO errors are 
recoverable.  Some of them can be handled by a port reset, which can be 
pretty awkward to perform but does occasionally work.  A lot of them 
occur because the USB cable has been unplugged; obviously there's no way 
to recover from that.  With only a few exceptions, the best and simplest 
approach is not to try to recover at all.
If the cable is unplugged, then we should get a connection change event
and the driver can handle it properly.
Yes -- unless the driver is in such a tight retry loop that the rest of 
the system never gets a chance to process the connection change event.  
I've seen bug reports where that happened.
I see. I'll keep that in mind, but it sounds like HW issue? The driver
handles retry base on events generated from the HW and the HW should
properly generate connection event and not get stuck in some loop.
quoted
Yes, it's probably simplest to do a port reset and let the transfer be
incomplete/corrupted. However, I think we should give
ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
mechanism for storage class driver, even though that it may not be
foolproof.
When you say storage class driver, which one are you talking about,
usb-storage or uas?  usb-storage already has a pretty robust recovery 
mechanism.
I mean for both.
quoted
quoted
For the case in question (the syzbot bug report that started this 
thread), the class driver doesn't try to perform any recovery.  It just 
resubmits the URB, getting into a tight retry loop which consumes too 
much CPU time.  Simply giving up would be preferable.

Alan Stern
I see. By giving up, you mean doing port reset right? Otherwise it needs
some other mechanism to synchronize with the device side.
No, I mean the driver should just stop communicating with the device.  
That's an appropriate action for lots of drivers.  If the user wants to 
re-synchronize with the device, he can unplug the USB cable and plug it 
back in again.

Alan Stern
Ok. Would it be more difficult to automate this if it requires user
intervention? I assume syzbot doesn't want the user to do that.

Thanks,
Thinh
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help