Thread (1 message) 1 message, 1 author, 2013-02-26

Re: Reproduceable SATA lockup on 3.7.8 with SSD

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: 2013-02-26 01:02:38
Also in: lkml

Possibly related (same subject, not in this thread)

* Marc MERLIN (marc@merlins.org) wrote:
Howdy,

I seem to have the same problem (or similar) as Mathieu Desnoyers in
https://lkml.org/lkml/2013/2/22/437

I can reliably get my SSD to drop from the SATA bus given the right workload
on linux.

How can I tell if it's linux's fault of the drive's fault?
Here is a pseudo-git-blame checklist that might be useful for accurate
finger-pointing when a drive fails:

- try diagnostic tools from your drive vendor, if it reports your drive
  as bad, then it might just be your drive failing,
- try to run a SMART test from smartmontools,
- try to reproduce your issue with a simple test-case (trying my test
  program might help) that clearly fails quickly, and all the time, on
  your problematic hardware,
- find out if there are known firmware upgrades for your drive provided
  by your vendor, try them out,
- find out if there are known BIOS upgrades for your machine provided by
  your vendor, try them out,
- try test-case on various kernel versions,
- try test-case on various distributions (just in case),
- try test-case with power management disabled in your machine's BIOS,
- try test-case with other SSD drives of the exact same model as
  yours, so you can see if it's just you own drive failing,
- try moving your drive to a different machine (same model, different
  model), and see if the test-case still fails,
- try with other SSD drives (from other vendors) on your machine,
- check if you partition mount options enable TRIM or not, try to
  disable TRIM explicitly (see mount(8), discard/nodiscard option),
- try using a different filesystem (just in case),
- try using a different block I/O scheduler,
- try using your drive vendor's SSD eraser, to reinitialize your entire
  disk (yes, you will lose you entire data). This might be useful if
  TRIM handling has changed after a firmware upgrade for instance.

With all those results in hand, it should become easier to identify the
cause of your problem. My personal research currently indicate that all
the Intel SSDSC2BW180A3L drives found in Lenovo x230 laptops I have
tested so far (4 different laptops) all fail after a couple of minutes
with my simple random-access-write workload. Moving the drives into a
different laptop (x200) does not help (it still fails).

Good luck!

Mathieu
Thanks,
Marc

----- Forwarded message from Marc MERLIN [off-list ref] -----

From: Marc MERLIN <redacted>
To: linux-ide@vger.kernel.org

Hopefully this is the right list. I know that IDE!=SATA, but I can't find
a SATA list.
Please redirect me if needed.

Hardware:
Lenovo T530, 64bit kernel and userland.
Hadware is shown below, but 2 drives, one SSD (OCZ-VERTEX4) and one HD (Hitachi HTS54101).

The SSD will lockup reliably if I do a specific mencoder command that reads MP4
files and rewrites them to another file in the same directory.

The log of what happens is shown below, the drive is eventually taken off the bus.
Once I reboot, it back, as if nothing happened.
If I do the same command on the HD, it works, but of course timings will be different
since the HD is slower.

How can I tell if it's the SSD's firmware's fault, or the linux SATA/AHCI code
that is buggy?

Thanks,
Marc

Failure log:
ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:00:00:38:13/04:00:33:00:00/40 tag 0 ncq 524288 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:08:00:3c:13/04:00:33:00:00/40 tag 1 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
(snipped)
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:e8:00:30:13/04:00:33:00:00/40 tag 29 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:f0:00:34:13/04:00:33:00:00/40 tag 30 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: COMRESET failed (errno=-16)
ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: COMRESET failed (errno=-16)
ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: COMRESET failed (errno=-16)
ata1: limiting SATA link speed to 3.0 Gbps
ata1: hard resetting link
ata1: COMRESET failed (errno=-16)
ata1: reset failed, giving up
ata1.00: disabled
ata1.00: device reported invalid CHS sector 0
(...)
ata1.00: device reported invalid CHS sector 0
ata1: EH complete
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda]  
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: 
Write(10): 2a 00 33 13 34 00 00 04 00 00
end_request: I/O error, dev sda, sector 856896512
sd 0:0:0:0: [sda] Unhandled error code


Boot shows:
ahci 0000:00:1f.2: version 3.0
ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
ahci: SSS flag set, parallel bus scan disabled
ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x13 impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq ilck stag pm led clo pio slum part ems sxs apst 
ahci 0000:00:1f.2: setting latency timer to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
scsi4 : ahci
scsi5 : ahci
ata1: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538100 irq 42
ata2: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538180 irq 42
ata3: DUMMY
ata4: DUMMY
ata5: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538300 irq 42
ata6: DUMMY
scsi6 : pata_legacy
ata7: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
ata1.00: ATA-9: OCZ-VERTEX4, 1.5, max UDMA/133
ata1.00: 1000215216 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      OCZ-VERTEX4      1.5  PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 1000215216 512-byte logical blocks: (512 GB/476 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: [sda] Attached SCSI disk
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata2.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
ata2.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
ata2.00: ATA-8: Hitachi HTS541010A9E680, JA0OA480, max UDMA/133
ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata2.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
ata2.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
ata2.00: configured for UDMA/133
scsi 1:0:0:0: Direct-Access     ATA      Hitachi HTS54101 JA0O PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 1:0:0:0: [sdb] 4096-byte physical blocks
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ACPI: Invalid Power Resource to register!
ACPI: Invalid Power Resource to register!<6>[    1.433751] tsc: Refined TSC clocksource calibration: 2893.427 MHz
Switching to clocksource tsc
 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: [sdb] Attached SCSI disk
ata5: SATA link down (SStatus 0 SControl 300)
scsi7 : pata_legacy
ata8: PATA max PIO4 cmd 0x170 ctl 0x376 irq 15

----- End forwarded message -----

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help