Re: INFO: rcu detected stall in ext4_write_checks
From: Dmitry Vyukov <dvyukov@google.com>
Date: 2019-07-05 13:18:20
Also in:
linux-ext4, lkml
On Wed, Jun 26, 2019 at 8:43 PM Theodore Ts'o [off-list ref] wrote:
On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:quoted
Hello, syzbot found the following crash on: HEAD commit: abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000 kernel config: https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368 compiler: gcc (GCC) 9.0.0 20181231 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000 The bug was bisected to: commit 0c81ea5db25986fb2a704105db454a790c59709c Author: Elad Raz [off-list ref] Date: Fri Oct 28 19:35:58 2016 +0000 mlxsw: core: Add port type (Eth/IB) set APIUm, so this doesn't pass the laugh test.quoted
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000It looks like the automated bisection machinery got confused by two failures getting triggered by the same repro; the symptoms changed over time. Initially, the failure was: crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter} Later, the failure changed to something completely different, and much earlier (before the test was even started): run #5: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" "-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" "/tmp/syz-executor216456474" "root@10.128.15.205:./syz-executor216456474"]: exit status 1 Connection timed out during banner exchange lost connection Looks like an opportunity to improve the bisection engine?
Hi Ted, Yes, these infrastructure errors plague bisections episodically. That's https://github.com/google/syzkaller/issues/1250 It did not confuse bisection explicitly as it understands that these are infrastructure failures rather then a kernel crash, e.g. here you may that it correctly identified that this run was OK and started bisection in v4.10 v4.9 range besides 2 scp failures: testing release v4.9 testing commit 69973b830859bc6529a7a0468ba0d80ee5117826 with gcc (GCC) 5.5.0 run #0: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" ...]: exit status 1 Connection timed out during banner exchange run #1: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" ....]: exit status 1 Connection timed out during banner exchange run #2: OK run #3: OK run #4: OK run #5: OK run #6: OK run #7: OK run #8: OK run #9: OK # git bisect start v4.10 v4.9 Though, of course, it may confuse bisection indirectly by reducing number of tests per commit. So far I wasn't able to gather any significant info about these failures. We gather console logs, but on these runs they are empty. It's easy to blame everything onto GCE but I don't have any bit of information that would point either way. These failures just appear randomly in production and usually in batches...