Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
From: Shawn Anastasio <hidden>
Date: 2019-06-18 07:04:42
On 6/18/19 1:39 AM, Alexey Kardashevskiy wrote:
On 18/06/2019 14:26, Shawn Anastasio wrote:quoted
On 6/12/19 2:15 PM, Shawn Anastasio wrote:quoted
On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:quoted
On 12/06/2019 15:05, Shawn Anastasio wrote:quoted
On 6/5/19 11:11 PM, Shawn Anastasio wrote:quoted
On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:quoted
This is an attempt to allow DMA masks between 32..59 which are not large enough to use either a PHB3 bypass mode or a sketchy bypass. Depending on the max order, up to 40 is usually available. This is based on v5.2-rc2. Please comment. Thanks.I have tested this patch set with an AMD GPU that's limited to <64bit DMA (I believe it's 40 or 42 bit). It successfully allows the card to operate without falling back to 32-bit DMA mode as it does without the patches. Relevant kernel log message:[ 0.311211] pci 0033:01 : [PE# 00] Enabling 64-bit DMA bypassTested-by: Shawn Anastasio <redacted>After a few days of further testing, I've started to run into stability issues with the patch applied and used with an AMD GPU. Specifically, the system sometimes spontaneously crashes. Not just EEH errors either, the whole system shuts down in what looks like a checkstop. Perhaps some subtle corruption is occurring?Have you tried this? https://patchwork.ozlabs.org/patch/1113506/I have not. I'll give it a shot and try it out for a few days to see if I'm able to reproduce the crashes.A few days later and I was able to reproduce the checkstop while watching a video in mpv. At this point the system had ~4 day uptime and this wasn't the first video I watched during that time. This is with https://patchwork.ozlabs.org/patch/1113506/ applied, too.Any logs left? What was the reason for the checkstop and what is the hardware? "lscpu" and "lspci -vv" for the starter would help. Thanks,
The machine is a Talos II with 2x 8 core DD2.2 Sforza modules. I've added the output of lscpu and lspci below. As for logs, it doesn't seem there are any kernel logs of the event. The opal-gard utility shows some error records which I have also included below. opal-gard:
$ sudo ./opal-gard show 1
Record ID: 0x00000001
========================
Error ID: 0x9000000b
Error Type: Fatal (0xe3)
Path Type: physical
>Sys, Instance #0
>Node, Instance #0
>Proc, Instance #1
>EQ, Instance #0
>EX, Instance #0
$ sudo ./opal-gard show 2
Record ID: 0x00000002
========================
Error ID: 0x90000021
Error Type: Fatal (0xe3)
Path Type: physical
>Sys, Instance #0
>Node, Instance #0
>Proc, Instance #1
>EQ, Instance #2
>EX, Instance #1
lscpu:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 52
On-line CPU(s) list: 0-3,8-31,36-47,52-63
Thread(s) per core: 4
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2154.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-3,8-31
NUMA node8 CPU(s): 36-47,52-63
lspci -vv: Output at: https://upaste.anastas.io/IwVQzt