Re: sun x4500 soft lockup during raid creation
From: Jody McIntyre <hidden>
Date: 2009-01-29 22:54:09
On Wed, Jan 28, 2009 at 10:30:33PM +0200, Vladimir Ivashchenko wrote:
CentOS 5.2, 2.6.18-92.1.22.el5PAE, sata_mv. Two dual-core Opterons @ 2.8 Ghz, 16 GB RAM.
You should really be running the EL 5.3 kernel - sata_mv in EL 5.2 has known issues according to the x4500 team but they are happy with the version in EL 5.3.
Any stability assurances or workarounds are highly appreciated. :)
It's just a lockup, not a crash. The system will be fine. We've seen a lot of these, and there's a workaround patch attached to this bug: https://bugzilla.lustre.org/show_bug.cgi?id=17084 It's probably the same bug seen here, as pointed out by Richard Scobie: http://marc.info/?l=linux-raid&m=123264525708803&w=2 The problem is not specific to the x4500 - I've seen it with many configurations, including on non-Sun hardware, generally when lots of disks are involved in a rebuild. I have not seen it with any mainline kernel in the past 6 months (they are much more recent than EL 5) but it may still exist. As a complete side note, you'll likely see better performance if you stagger disks across controllers (the x4500 has 6) rather than creating arrays with most disks from 3 controllers. Note: I don't work for Sun support or the x4500 product team and nothing in this message is necessarily an official Sun position. Cheers, Jody
Jan 28 21:31:32 SunSTG kernel: BUG: soft lockup - CPU#0 stuck for 10s!
[md3_raid5:5672]
Jan 28 21:31:32 SunSTG kernel:
Jan 28 21:31:32 SunSTG kernel: Pid: 5672, comm: md3_raid5
Jan 28 21:31:32 SunSTG kernel: EIP: 0060:[<f8d68162>] CPU: 0
Jan 28 21:31:32 SunSTG kernel: EIP is at raid6_sse22_gen_syndrome
+0x10a/0x1b6 [raid456]
Jan 28 21:31:32 SunSTG kernel: EFLAGS: 00000202 Not tainted
(2.6.18-92.1.22.el5PAE #1)
Jan 28 21:31:32 SunSTG kernel: EAX: ea0774e0 EBX: 000004e0 ECX: ead0ad30
EDX: ea077000
Jan 28 21:31:32 SunSTG kernel: ESI: ead0ade0 EDI: 00000004 EBP: ead0add0
DS: 007b ES: 007b
Jan 28 21:31:32 SunSTG kernel: CR0: 80050033 CR2: 0806e000 CR3: 373239e0
CR4: 000006f0
Jan 28 21:31:32 SunSTG kernel: [<f8d63562>] compute_parity6+0x21c/0x28a
[raid456]
Jan 28 21:31:32 SunSTG kernel: [<f8d6452e>] handle_stripe+0xc8b/0x215e
[raid456]
Jan 28 21:31:32 SunSTG kernel: [<c041fdb3>] enqueue_task+0x29/0x39
Jan 28 21:31:32 SunSTG kernel: [<c0420629>] try_to_wake_up+0x371/0x37b
Jan 28 21:31:32 SunSTG kernel: [<c041edec>] __wake_up_common+0x2f/0x53
Jan 28 21:31:32 SunSTG kernel: [<c041fbe6>] __wake_up+0x2a/0x3d
Jan 28 21:31:32 SunSTG kernel: [<f8d61744>] release_stripe+0x21/0x2e
[raid456]
Jan 28 21:31:33 SunSTG kernel: [<f8d65b0c>] raid5d+0x10b/0x130
[raid456]
Jan 28 21:31:33 SunSTG kernel: [<c059aca8>] md_thread+0xdf/0xf5
Jan 28 21:31:33 SunSTG kernel: [<c0436347>] autoremove_wake_function
+0x0/0x2d
Jan 28 21:31:33 SunSTG kernel: [<c059abc9>] md_thread+0x0/0xf5
Jan 28 21:31:33 SunSTG kernel: [<c0436285>] kthread+0xc0/0xeb
Jan 28 21:31:33 SunSTG kernel: [<c04361c5>] kthread+0x0/0xeb
Jan 28 21:31:33 SunSTG kernel: [<c0405c3b>] kernel_thread_helper
+0x7/0x10
Jan 28 21:31:33 SunSTG kernel: =======================
Jan 28 21:32:26 SunSTG kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[md3_raid5:5672]
Jan 28 21:32:26 SunSTG kernel:
Jan 28 21:32:26 SunSTG kernel: Pid: 5672, comm: md3_raid5
Jan 28 21:32:26 SunSTG kernel: EIP: 0060:[<f8d68170>] CPU: 2
Jan 28 21:32:26 SunSTG kernel: EIP is at raid6_sse22_gen_syndrome
+0x118/0x1b6 [raid456]
Jan 28 21:32:26 SunSTG kernel: EFLAGS: 00000202 Not tainted
(2.6.18-92.1.22.el5PAE #1)
Jan 28 21:32:26 SunSTG kernel: EAX: ea784040 EBX: 00000040 ECX: ead0ad30
EDX: ea784000
Jan 28 21:32:26 SunSTG kernel: ESI: ead0adf0 EDI: 00000008 EBP: ead0add0
DS: 007b ES: 007b
Jan 28 21:32:26 SunSTG kernel: CR0: 80050033 CR2: b7f6f000 CR3: 3714e920
CR4: 000006f0
Jan 28 21:32:26 SunSTG kernel: [<f8d63562>] compute_parity6+0x21c/0x28a
[raid456]
Jan 28 21:32:26 SunSTG kernel: [<f8d6452e>] handle_stripe+0xc8b/0x215e
[raid456]
Jan 28 21:32:26 SunSTG kernel: [<c041f34b>] find_busiest_group
+0x177/0x462
Jan 28 21:32:26 SunSTG kernel: [<c041fc53>] task_rq_lock+0x31/0x58
Jan 28 21:32:26 SunSTG kernel: [<c0420629>] try_to_wake_up+0x371/0x37b
Jan 28 21:32:26 SunSTG kernel: [<f8d6171e>] __release_stripe+0xfc/0x101
[raid456]
Jan 28 21:32:26 SunSTG kernel: [<f8d61744>] release_stripe+0x21/0x2e
[raid456]
Jan 28 21:32:26 SunSTG kernel: [<f8d65b0c>] raid5d+0x10b/0x130
[raid456]
Jan 28 21:32:26 SunSTG kernel: [<c059aca8>] md_thread+0xdf/0xf5
Jan 28 21:32:26 SunSTG kernel: [<c0436347>] autoremove_wake_function
+0x0/0x2d
Jan 28 21:32:26 SunSTG kernel: [<c059abc9>] md_thread+0x0/0xf5
Jan 28 21:32:26 SunSTG kernel: [<c0436285>] kthread+0xc0/0xeb
Jan 28 21:32:26 SunSTG kernel: [<c04361c5>] kthread+0x0/0xeb
Jan 28 21:32:26 SunSTG kernel: [<c0405c3b>] kernel_thread_helper
+0x7/0x10
Jan 28 21:32:26 SunSTG kernel: =======================
<somewhere here I issue commands to create md4>
Jan 28 21:32:43 SunSTG kernel: md: syncing RAID array md4
Jan 28 21:32:43 SunSTG kernel: md: minimum _guaranteed_ reconstruction
speed: 1000 KB/sec/disc.
Jan 28 21:32:43 SunSTG kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jan 28 21:32:43 SunSTG kernel: md: using 128k window, over a total of
244195200 blocks.
Jan 28 21:33:20 SunSTG kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[md4_raid5:5694]
Jan 28 21:33:20 SunSTG kernel:
Jan 28 21:33:20 SunSTG kernel: Pid: 5694, comm: md4_raid5
Jan 28 21:33:20 SunSTG kernel: EIP: 0060:[<f8d63aff>] CPU: 3
Jan 28 21:33:20 SunSTG kernel: EIP is at handle_stripe+0x25c/0x215e
[raid456]
Jan 28 21:33:20 SunSTG kernel: EFLAGS: 00000282 Not tainted
(2.6.18-92.1.22.el5PAE #1)
Jan 28 21:33:20 SunSTG kernel: EAX: f6a2b404 EBX: 00000001 ECX: f53d17c0
EDX: e8c532c0
Jan 28 21:33:20 SunSTG kernel: ESI: e8c532c4 EDI: 00000016 EBP: e8c52b64
DS: 007b ES: 007b
Jan 28 21:33:20 SunSTG kernel: CR0: 8005003b CR2: b7cfc000 CR3: 3714ef00
CR4: 000006f0
Jan 28 21:33:20 SunSTG kernel: [<c041f34b>] find_busiest_group
+0x177/0x462
Jan 28 21:33:20 SunSTG kernel: [<c041fc53>] task_rq_lock+0x31/0x58
Jan 28 21:33:20 SunSTG kernel: [<c041fdb3>] enqueue_task+0x29/0x39
Jan 28 21:33:20 SunSTG kernel: [<c0420629>] try_to_wake_up+0x371/0x37b
Jan 28 21:33:20 SunSTG kernel: [<c041edec>] __wake_up_common+0x2f/0x53
Jan 28 21:33:20 SunSTG kernel: [<c041fbe6>] __wake_up+0x2a/0x3d
Jan 28 21:33:20 SunSTG kernel: [<f8d61744>] release_stripe+0x21/0x2e
[raid456]
Jan 28 21:33:20 SunSTG kernel: [<f8d65b0c>] raid5d+0x10b/0x130
[raid456]
Jan 28 21:33:20 SunSTG kernel: [<c059aca8>] md_thread+0xdf/0xf5
Jan 28 21:33:20 SunSTG kernel: [<c0436347>] autoremove_wake_function
+0x0/0x2d
Jan 28 21:33:20 SunSTG kernel: [<c059abc9>] md_thread+0x0/0xf5
Jan 28 21:33:21 SunSTG kernel: [<c0436285>] kthread+0xc0/0xeb
Jan 28 21:33:21 SunSTG kernel: [<c04361c5>] kthread+0x0/0xeb
Jan 28 21:33:21 SunSTG kernel: [<c0405c3b>] kernel_thread_helper
+0x7/0x10
Jan 28 21:33:21 SunSTG kernel: =======================
Jan 28 21:33:50 SunSTG kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[md4_raid5:5694]
Jan 28 21:33:50 SunSTG kernel:
Jan 28 21:33:50 SunSTG kernel: Pid: 5694, comm: md4_raid5
Jan 28 21:33:50 SunSTG kernel: EIP: 0060:[<f8bf9813>] CPU: 3
Jan 28 21:33:50 SunSTG kernel: EIP is at xor_sse_5+0xa0/0x3b5 [xor]
Jan 28 21:33:50 SunSTG kernel: EFLAGS: 00000202 Not tainted
(2.6.18-92.1.22.el5PAE #1)
Jan 28 21:33:50 SunSTG kernel: EAX: 0000000b EBX: e8e66500 ECX: e8e69500
EDX: e8e6e500
Jan 28 21:33:50 SunSTG kernel: ESI: e8e67500 EDI: e8e68500 EBP: e96b5dd4
DS: 007b ES: 007b
Jan 28 21:33:50 SunSTG kernel: CR0: 80050033 CR2: b7cfc000 CR3: 3714ef00
CR4: 000006f0
Jan 28 21:33:50 SunSTG kernel: [<f8bfa200>] xor_block+0x74/0x7d [xor]
Jan 28 21:33:50 SunSTG kernel: [<f8d636b3>] compute_block_1+0xe3/0x13a
[raid456]
Jan 28 21:33:50 SunSTG kernel: [<f8d644ba>] handle_stripe+0xc17/0x215e
[raid456]
Jan 28 21:33:50 SunSTG kernel: [<c041f34b>] find_busiest_group
+0x177/0x462
Jan 28 21:33:50 SunSTG kernel: [<c041fdb3>] enqueue_task+0x29/0x39
Jan 28 21:33:50 SunSTG kernel: [<c0420629>] try_to_wake_up+0x371/0x37b
Jan 28 21:33:50 SunSTG kernel: [<c041edec>] __wake_up_common+0x2f/0x53
Jan 28 21:33:50 SunSTG kernel: [<c041fbe6>] __wake_up+0x2a/0x3d
Jan 28 21:33:50 SunSTG kernel: [<f8d61744>] release_stripe+0x21/0x2e
[raid456]
Jan 28 21:33:50 SunSTG kernel: [<f8d65b0c>] raid5d+0x10b/0x130
[raid456]
Jan 28 21:33:50 SunSTG kernel: [<c059aca8>] md_thread+0xdf/0xf5
Jan 28 21:33:50 SunSTG kernel: [<c0436347>] autoremove_wake_function
+0x0/0x2d
Jan 28 21:33:50 SunSTG kernel: [<c059abc9>] md_thread+0x0/0xf5
Jan 28 21:33:51 SunSTG kernel: [<c0436285>] kthread+0xc0/0xeb
Jan 28 21:33:51 SunSTG kernel: [<c04361c5>] kthread+0x0/0xeb
Jan 28 21:33:51 SunSTG kernel: [<c0405c3b>] kernel_thread_helper
+0x7/0x10
Jan 28 21:33:51 SunSTG kernel: =======================
... and it goes on complaining about md4_raid5:5694.
[root@SunSTG ~]# mdadm --detail /dev/md3
/dev/md3:
Version : 00.90.03
Creation Time : Wed Jan 28 21:30:50 2009
Raid Level : raid6
Array Size : 5372294400 (5123.42 GiB 5501.23 GB)
Used Dev Size : 244195200 (232.88 GiB 250.06 GB)
Raid Devices : 24
Total Devices : 24
Preferred Minor : 3
Persistence : Superblock is persistent
Update Time : Wed Jan 28 21:30:50 2009
State : clean, resyncing
Active Devices : 24
Working Devices : 24
Failed Devices : 0
Spare Devices : 0
Chunk Size : 64K
Rebuild Status : 15% complete
UUID : d8c2b5ce:576a117b:f2494cd1:626a774c
Events : 0.1
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 65 160 1 active sync /dev/sdaa
2 65 176 2 active sync /dev/sdab
3 65 208 3 active sync /dev/sdad
4 65 224 4 active sync /dev/sdae
5 65 240 5 active sync /dev/sdaf
6 66 0 6 active sync /dev/sdag
7 66 16 7 active sync /dev/sdah
8 66 32 8 active sync /dev/sdai
9 66 48 9 active sync /dev/sdaj
10 66 64 10 active sync /dev/sdak
11 66 80 11 active sync /dev/sdal
12 66 96 12 active sync /dev/sdam
13 66 112 13 active sync /dev/sdan
14 66 128 14 active sync /dev/sdao
15 66 144 15 active sync /dev/sdap
16 66 160 16 active sync /dev/sdaq
17 66 176 17 active sync /dev/sdar
18 66 192 18 active sync /dev/sdas
19 66 208 19 active sync /dev/sdat
20 66 224 20 active sync /dev/sdau
21 66 240 21 active sync /dev/sdav
22 8 16 22 active sync /dev/sdb
23 8 32 23 active sync /dev/sdc
[root@SunSTG ~]# mdadm --detail /dev/md4
/dev/md4:
Version : 00.90.03
Creation Time : Wed Jan 28 21:32:39 2009
Raid Level : raid6
Array Size : 4883904000 (4657.65 GiB 5001.12 GB)
Used Dev Size : 244195200 (232.88 GiB 250.06 GB)
Raid Devices : 22
Total Devices : 22
Preferred Minor : 4
Persistence : Superblock is persistent
Update Time : Wed Jan 28 21:32:39 2009
State : clean, resyncing
Active Devices : 22
Working Devices : 22
Failed Devices : 0
Spare Devices : 0
Chunk Size : 64K
Rebuild Status : 17% complete
UUID : 7e2c7f35:f51c9047:40130c15:63a7cfa6
Events : 0.1
Number Major Minor RaidDevice State
0 8 48 0 active sync /dev/sdd
1 8 64 1 active sync /dev/sde
2 8 80 2 active sync /dev/sdf
3 8 96 3 active sync /dev/sdg
4 8 112 4 active sync /dev/sdh
5 8 128 5 active sync /dev/sdi
6 8 144 6 active sync /dev/sdj
7 8 160 7 active sync /dev/sdk
8 8 176 8 active sync /dev/sdl
9 8 192 9 active sync /dev/sdm
10 8 208 10 active sync /dev/sdn
11 8 224 11 active sync /dev/sdo
12 8 240 12 active sync /dev/sdp
13 65 0 13 active sync /dev/sdq
14 65 16 14 active sync /dev/sdr
15 65 32 15 active sync /dev/sds
16 65 48 16 active sync /dev/sdt
17 65 64 17 active sync /dev/sdu
18 65 80 18 active sync /dev/sdv
19 65 96 19 active sync /dev/sdw
20 65 112 20 active sync /dev/sdx
21 65 144 21 active sync /dev/sdz
--
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html