Thread (31 messages) 31 messages, 6 authors, 2019-07-23

Re: INFO: rcu detected stall in ext4_write_checks

From: Dmitry Vyukov <dvyukov@google.com>
Date: 2019-07-05 13:18:20
Also in: linux-ext4, lkml

On Wed, Jun 26, 2019 at 8:43 PM Theodore Ts'o [off-list ref] wrote:
On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:
quoted
Hello,

syzbot found the following crash on:

HEAD commit:    abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000

The bug was bisected to:

commit 0c81ea5db25986fb2a704105db454a790c59709c
Author: Elad Raz [off-list ref]
Date:   Fri Oct 28 19:35:58 2016 +0000

    mlxsw: core: Add port type (Eth/IB) set API
Um, so this doesn't pass the laugh test.
quoted
bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000
It looks like the automated bisection machinery got confused by two
failures getting triggered by the same repro; the symptoms changed
over time.  Initially, the failure was:

crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter}

Later, the failure changed to something completely different, and much
earlier (before the test was even started):

run #5: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" "-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" "/tmp/syz-executor216456474" "root@10.128.15.205:./syz-executor216456474"]: exit status 1
Connection timed out during banner exchange
lost connection

Looks like an opportunity to improve the bisection engine?
Hi Ted,

Yes, these infrastructure errors plague bisections episodically.
That's https://github.com/google/syzkaller/issues/1250

It did not confuse bisection explicitly as it understands that these
are infrastructure failures rather then a kernel crash, e.g. here you
may that it correctly identified that this run was OK and started
bisection in v4.10 v4.9 range besides 2 scp failures:

testing release v4.9
testing commit 69973b830859bc6529a7a0468ba0d80ee5117826 with gcc (GCC) 5.5.0
run #0: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ...]: exit status 1
Connection timed out during banner exchange
run #1: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ....]: exit status 1
Connection timed out during banner exchange
run #2: OK
run #3: OK
run #4: OK
run #5: OK
run #6: OK
run #7: OK
run #8: OK
run #9: OK
# git bisect start v4.10 v4.9

Though, of course, it may confuse bisection indirectly by reducing
number of tests per commit.

So far I wasn't able to gather any significant info about these
failures. We gather console logs, but on these runs they are empty.
It's easy to blame everything onto GCE but I don't have any bit of
information that would point either way. These failures just appear
randomly in production and usually in batches...
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help