On Mon, Feb 05, 2024 at 05:12:31PM -0600, Bjorn Helgaas wrote:
On Thu, Jan 25, 2024 at 02:27:59PM +0800, Wang, Qingshun wrote:
quoted
When Advisory Non-Fatal errors are raised, both correctable and
uncorrectable error statuses will be set. The current kernel code cannot
store both statuses at the same time, thus failing to handle ANFE properly.
In addition, to avoid clearing UEs that are not ANFE by accident, UE
severity and Device Status also need to be recorded: any fatal UE cannot
be ANFE, and if Fatal/Non-Fatal Error Detected is set in Device Status, do
not take any assumption and let UE handler to clear UE status.
Store status and mask of both correctable and uncorrectable errors in
aer_err_info. The severity of UEs and the values of the Device Status
register are also recorded, which will be used to determine UEs that should
be handled by the ANFE handler. Refactor the rest of the code to use
cor/uncor_status and cor/uncor_mask fields instead of status and mask
fields.
There's a lot going on in this patch. Could it possibly be split up a
bit, e.g., first tease apart aer_err_info.status/.mask into
.cor_status/mask and .uncor_status/mask, then add .uncor_severity,
then add the device_status bit separately? If it could be split up, I
think the ANFE case would be easier to see.
Thanks a lot for working on this area!
Bjorn
Thanks for the feedback! Will split it up into two pacthes in the next
version.
--
Best regards,
Wang, Qingshun