Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
From: Sudeep Holla <hidden>
Date: 2015-03-16 17:47:46
Hi Russell, On 16/03/15 13:04, Russell King - ARM Linux wrote:
On Mon, Mar 16, 2015 at 09:35:53AM +0000, Russell King - ARM Linux wrote:quoted
On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:quoted
On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:quoted
On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:quoted
I'm going to try a few other kernels to try and track down what's going on - whether something from arm-soc or my tree is responsible for this really weird behaviour.Okay, this is weird - it seems that it's caused by the FIQ oops dumping code/FIQ changes which I've carried for many months unchanged in my tree.More weirdness. Progressing forwards through my development code showed that when I merged the patch I mentioned in the previous mail, things started to fail. As I also mentioned, I'd drop that branch (two patches, one adding the IPI backtrace stuff and the second one updating the GIC to allow it to raise FIQs on suitably equipped platforms.) I would have expected that to have worked, but it just failed after four boot iterations. So either it's not the FIQ, or it is the FIQ code _and_ also something else. Or it has something to do with the placement of functions in the kernel. I'll try more stuff tomorrow, working from where I presently am (which is basically last night's code minus the FIQ changes) by removing other changes to see what brings us back to a working system. As I've already said - this is really weird because all of these changes were also tested against -rc1... those which weren't are: mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE mm: split ET_DYN ASLR from mmap ASLR mm: move randomize_et_dyn into ELF_ET_DYN_BASE mm: expose arch_mmap_rnd when available arm: factor out mmap ASLR into mmap_rnd and a number of clkdev rework patches (to make it use clk_hw internally.) Neither of these should be affecting it, but that's something I will be testing tomorrow.Okay, reverting the ASLR changes and the clkdev changes annoyingly still results in random failure.After ruling out ASLR and clkdev, I started progressively reverting other stuff in the build tree. Eventually, I got down to reverting the L2C change I've been carrying since the L2C cleanups. With that lot reverted, which is slightly more than the previously known good test, it booted five times without issue. So, I thought I'd add that L2C change to the list of bad commits, and try omitting _just_ the L2C and FIQ changes... and it still fails - on the first test boot iteration. I think I'm going to declare that this is all down to some obscure hardware problem with Versatile Express, which is tickled by the layout of the kernel against the cache, and take it out of the nightly system (it's pointless having unstable hardware being tested; random failures are completely meaningless.)
I was able to see exact behaviour on my VExpress setup with CA9X4
core-tile. Few observations from my side:
1. This issue can be reproduced even on v3.19
2. As you suspected L2C, I tried disabling L2C and it seems to solve
the issue
3. Since it's very random and enabling LL_DEBUG made it difficult to
reproduce the issue, I tried to dump the stack using DS5 debugger
4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
and on multiple runs
5. Connecting to h/w debugger, stopping and re-starting the CPUs,
solves the issue. It's helping CPUs to get out of __radix_tree_lookup
somehow
Stacktrace
==========
(sorry it's looks different from std. Linux backtrace as this one id
dump from DS5)
CPU 0
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
radix-tree.c:517
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq
= 16, lookup = <Value currently has no location>, regs = <Value
currently has no location> ) at irqdesc.c:391
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
#5 [__irq_svc+0x40]
CPU1
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
radix-tree.c:517
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags =
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value
in stack frame for register R0>, action = <Value currently has no
location>, hcpu = <Value not available : Undefined value in stack frame
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val =
<Value not available : Undefined value in stack frame for register R1>,
v = <Value not available : Undefined value in stack frame for register
R2>, nr_to_call = <Value not available : Undefined value in stack frame
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
#5 notifier_to_errno( ret = <Value currently has no location> ) at
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367
CPU2 & CPU3
-----------
Not booted yet, still waiting in bootloader