Re: [Bug 214147] New: ISCSI broken in last release

From: michael.christie@oracle.com
Date: 2021-09-01 23:48:37

On 8/23/21 6:08 AM, bugzilla-daemon@bugzilla.kernel.org wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=214147

            Bug ID: 214147
           Summary: ISCSI broken in last release
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 5.13.12
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: SCSI
          Assignee: linux-scsi@vger.kernel.org
          Reporter: slavon.net@gmail.com
        Regression: Yes

Created attachment 298441
  --> https://bugzilla.kernel.org/attachment.cgi?id=298441&action=edit
dmesg log

After some time iscsi go to broke and help only reboot

What are you doing when you hit the issue?

What does your target setup look like? What are you using for the
backing store?

Are you able to build your own kernels?

The only major changes between 5.12 and 5.13 is some target patches
to batch cmds. However, it looks like you start to hit a problem
earlier than when that code comes into play. We first see you hit
a data out timeout, so we don't even have all the data for the
cmd, so the target changes in 5.13 don't come into play yet.

[10931.107057] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1991-05.com.microsoft:vhost11.dev.obs.group,i,0x400001370002,iqn.2003-01.org.linux-iscsi.vm2.x8664:sn.b07943625401,t,0x01


However, we do see some cmds have made it to the core target layer
because we can see the target layer is waiting on cmds to complete
for part of the lun reset handling:

[19906.593285] INFO: task kworker/4:1:3770999 blocked for more than 122 seconds.
[19906.603670]       Tainted: P           O      5.13.12-1.el8.elrepo.x86_64 #1
[19906.613975] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[19906.624208] task:kworker/4:1     state:D stack:    0 pid:3770999 ppid:     2 flags:0x00004000
[19906.624212] Workqueue: events target_tmr_work [target_core_mod]
[19906.624247] Call Trace:
[19906.624249]  __schedule+0x396/0x8a0
[19906.624252]  schedule+0x3c/0xa0
[19906.624255]  schedule_timeout+0x215/0x2b0
[19906.624258]  ? kasprintf+0x4e/0x70
[19906.624261]  wait_for_completion+0x9e/0x100
[19906.624264]  target_put_cmd_and_wait+0x55/0x80 [target_core_mod]
[19906.624279]  core_tmr_lun_reset+0x38b/0x660 [target_core_mod]
[19906.624294]  target_tmr_work+0xb4/0x110 [target_core_mod]
[19906.624309]  process_one_work+0x230/0x3d0
[19906.624312]  worker_thread+0x2d/0x3e0
[19906.624314]  ? process_one_work+0x3d0/0x3d0
[19906.624316]  kthread+0x118/0x140
[19906.624318]  ? set_kthread_struct+0x40/0x40
[19906.624320]  ret_from_fork+0x1f/0x30

and we can see iscsi layer not able to relogin because of outstanding
cmds/tmfs.

I can send you a patch that reverts the core target patches. If we can
rule them out then it would help narrow things down.

Or, because it sounds like this is easy to reproduce we can turn on some
extra lio debugging.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help