Re: RAID5 hard freeze
From: NeilBrown <hidden>
Date: 2014-02-25 02:58:09
On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan [off-list ref] wrote:
Hi all I am struggling to diagnose a strange freeze of software RAID5 array. My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it. It works fine unless I start several process writing intensively to it. At first, it looks like the system is under high pressure, then the system starts lagging a lot and a hard freeze always follows after several minutes. No errors in system log, nothing is emitted to console. Just hard freeze with HDD light always on. I tried enabling kernel network logging to another machine and again no information when hanging. After reboot, my array starts reconstruction and finishes without errors. I tried disabling quotas and barriers for ext4. After disabling barriers, it almost seemed to work, but after some time the same hard freeze happens. I tested the same hardware configuration under Linux v3.10, 3.11, 3.12 and now 3.13.5 (all x86 arch) behaves the same way. The same issue can be reproduced easily. So now I tested everything Google suggests on the matter. Could you give a hint on how to debug this issue?
The most useful thing for debugging a hard freeze is the alt-sysrq-T output when it is frozen. typing that magic sequence should always produce some output unless it is hard-frozen with interrupts disabled. So make sure you can produce the output when the system is working properly (to a log file file the network console would be ideal), then when it hangs, produce the output again. To probably need to have a text console rather than a graphic console for it to work. If it is hard-hanging with interrupts disabled, then it gets tricky. I thought there was some NMI-based lockup detector which would warn if that happened, but I cannot find it just now. NeilBrown
Attachments
- signature.asc [application/pgp-signature] 828 bytes