Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 driver for... | linux-arm-kernel

quoted

[cc:ing honeycomb-users, didn't think of that earlier]

On Mon, Feb 10, 2020 at 5:16 PM Russell King - ARM Linux admin
[off-list ref] wrote:
On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote:
On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
[off-list ref] wrote:
On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou [off-list ref] wrote:
Hi Olof,

Thanks a lot for your comments!
And sorry for my delay respond!
Actually, they apply with only minor conflicts on top of current -next.

Bjorn, any chance we can get you to pick these up pretty soon? They
enable full use of a promising ARM developer system, the SolidRun
HoneyComb, and would be quite valuable for me and others to be able to
use with mainline or -next without any additional patches applied --
which this patchset achieves.

I know there are pending revisions based on feedback. I'll leave it up
to you and others to determine if that can be done with incremental
patches on top, or if it should be fixed before the initial patchset
is applied. But all in all, it's holding up adaption by me and surely
others of a very interesting platform -- I'm looking to replace my
aging MacchiatoBin with one of these and would need PCIe/NVMe to work
before I do.
If you're going to be using NVMe, make sure you use a power-fail safe
version; I've already had one instance where ext4 failed to mount
because of a corrupted journal using an XPG SX8200 after the Honeycomb
Serror'd, and then I powered it down after a few hours before later
booting it back up.

EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
EXT4-fs (nvme0n1p2): write access will be enabled during recovery
JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
EXT4-fs (nvme0n1p2): error loading journal
Hmm, using btrfs on mine, not sure if the exposure is similar or not.
As I understand the problem, it isn't a filesystem issue.  It's a data
integrity issue with the NVMe over power fail, how they cache the data,
and ultimately write it to the nand flash.

Have a read of:

https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection

As NVMe and SSD are basically the same underlying technology (the host
interface is different) and the issues I've heard, and now experienced
with my NVMe, I think the above is a good pointer to the problems of
flash mass storage.

As I understand it, the problem occurs when the mapping table has not
been written back to flash, power is lost without the Standby Immediate
command being sent, and there is no way for the firmware to quickly
save the table.  On subsequent power up, the firmware has to
reconstruct the mapping table, and depending on how that is done,
incorrect (old?) data may be returned for some blocks.

That can happen to any blocks on the drive, which means any data can
be at risk from a power loss event, whether that is a power failure
or after a crash.
Makes me suspect if there's some board-level power/reset sequencing
issue, or if there's a problem with one card going down disabling
others. I haven't read the specs enough to know what's expected
behavior but I've seen similar issues on other platforms so take it
with a grain of salt.

Do you know if the SErr was due to a known issue and/or if it's
something that's fixed in production silicon?
The SError is triggered by something on the PCIe side of things; if I
leave the Mellanox PCIe card out, then I don't get them.  The errata
patches I have merged into my tree help a bit, turning the code from
being unable to boot without a SError with the card plugged in, to
being able to boot and last a while - but the SErrors still eventually
come, maybe taking a few days... and that's without the Mellanox
ethernet interface being up.

(I still can't enable SMMU since across a warm reboot it fails
*completely*, with nothing coming up and working. NXP folks, you
listening? :)
Is it just a warm reboot?  I thought I saw SMMU activity on a cold
boot as well, implying that there were devices active that Linux
did not know about.
Yeah, 100% reproducible on warm reboot -- every single time. Not on
cold boot though (100% success rate as far as I remember). I boot with
kernel on NVMe on PCIe, native 1GbE for networking. u-boot from SD
card.

This is with the SolidRun u-boot from GitHub.

-Olof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help