Re: [PATCH v3 3/3] EDAC/amd64: Enumerate memory on noncpu nodes
From: Yazen Ghannam <yazen.ghannam@amd.com>
Date: 2021-09-01 18:42:39
Also in:
lkml
On Fri, Aug 27, 2021 at 01:30:57PM +0200, Borislav Petkov wrote:
On Tue, Aug 24, 2021 at 12:24:37AM +0530, Naveen Krishna Chatradhi wrote:
...
quoted
@@ -1335,6 +1392,11 @@ static void determine_memory_type(struct amd64_pvt *pvt) u32 dram_ctrl, dcsm; if (pvt->umc) { + if (pvt->is_noncpu) { + pvt->dram_type = MEM_HBM2; + return; + }I don't like this sprinkling of "if (pvt->is_noncpu)" everywhere, at all. Please define a separate read_mc_regs_df() or so which contains only the needed functionalty which you can carve out from read_mc_regs().
I like this idea.
quoted
+ if ((pvt->umc[0].dimm_cfg | pvt->umc[1].dimm_cfg) & BIT(5)) pvt->dram_type = MEM_LRDDR4; else if ((pvt->umc[0].dimm_cfg | pvt->umc[1].dimm_cfg) & BIT(4))@@ -1724,7 +1786,10 @@ static int f17_early_channel_count(struct amd64_pvt *pvt) /* SDP Control bit 31 (SdpInit) is clear for unused UMC channels */ for_each_umc(i) - channels += !!(pvt->umc[i].sdp_ctrl & UMC_SDP_INIT); + if (pvt->is_noncpu) + channels += pvt->csels[i].b_cnt; + else + channels += !!(pvt->umc[i].sdp_ctrl & UMC_SDP_INIT); amd64_info("MCT channel count: %d\n", channels);No, a separate gpu_early_channel_count() is needed here. There's a reason for those function pointers getting assigned depending on family.
Good point. ...
quoted
+/* + * The CPUs have one channel per UMC, So a UMC number is equivalent to a + * channel number. The NONCPUs have 8 channels per UMC, so the UMC number no + * longer works as a channel number. + * The channel number within a NONCPU UMC is given in MCA_IPID[15:12]. + * However, the IDs are split such that two UMC values go to one UMC, and + * the channel numbers are split in two groups of four. + * + * Refer comment on get_noncpu_umc_base() from amd64_edac.h + * + * For example, + * UMC0 CH[3:0] = 0x0005[3:0]000 + * UMC0 CH[7:4] = 0x0015[3:0]000 + * UMC1 CH[3:0] = 0x0025[3:0]000 + * UMC1 CH[7:4] = 0x0035[3:0]000 + */ +static int find_umc_channel_noncpu(struct mce *m) +{ + u8 umc = find_umc_channel(m); + u8 ch = ((m->ipid >> 12) & 0xf); + + return umc % 2 ? (ch + 4) : ch; +} + static void decode_umc_error(int node_id, struct mce *m) { u8 ecc_type = (m->status >> 45) & 0x3;@@ -2897,6 +3003,7 @@ static void decode_umc_error(int node_id, struct mce *m) struct amd64_pvt *pvt; struct err_info err; u64 sys_addr; + u8 df_inst_id;You don't need that variable and can work with err.channel just fine.quoted
mci = edac_mc_find(node_id); if (!mci)@@ -2909,7 +3016,22 @@ static void decode_umc_error(int node_id, struct mce *m) if (m->status & MCI_STATUS_DEFERRED) ecc_type = 3; - err.channel = find_umc_channel(m); + if (pvt->is_noncpu) { + /* + * The NONCPUs have one Chip Select per UMC, so the UMC number + * can used as the Chip Select number. However, the UMC number + * is split in the ID value so it's necessary to divide by 2. + */ + err.csrow = find_umc_channel(m) / 2; + err.channel = find_umc_channel_noncpu(m); + /* On NONCPUs, instance id is calculated as below. */ + df_inst_id = err.csrow * 8 + err.channel;err.channel += err.csrow * 8; tadaaa!
err.channel still needs to be used in error_address_to_page_and_offset() below. So changing it here messes up what's reported to EDAC. ...
quoted
@@ -3804,6 +3963,9 @@ static int probe_one_instance(unsigned int nid) struct ecc_settings *s; int ret; + if (!F3) + return 0; + ret = -ENOMEM; s = kzalloc(sizeof(struct ecc_settings), GFP_KERNEL); if (!s)@@ -3815,6 +3977,9 @@ static int probe_one_instance(unsigned int nid) if (!pvt) goto err_settings; + if (nid >= NONCPU_NODE_INDEX) + pvt->is_noncpu = true;This is silly and error-prone. Proper detection should happen in per_family_init() and there you should read out from the hardware whether this is a GPU or a CPU node. Then, you should put an enum type in amd64_family_type which has { FAM_TYPE_CPU, FAM_TYPE_GPU, ... } etc and the places where you need to check whether it is CPU or a GPU, test those types.
This is a good idea. But we have a global *fam_type, so this should be moved into struct amd64_pvt, if possible. Then each node can have its own fam_type. ..
quoted
@@ -389,6 +392,9 @@ struct amd64_pvt { enum mem_type dram_type; struct amd64_umc *umc; /* UMC registers */ + char buf[20];A 20 char buffer in every pvt structure just so that you can sprintf into it when it is a GPU? Err, I don't think so. You can do the same thing as with the CPUs - the same string for every pvt instance.
Fair point. I like the idea of having unique names though. Is this possible with the current EDAC framework? Or is it not worth it? Thanks, Yazen