Thread (67 messages) 67 messages, 6 authors, 2021-10-21

Re: [PATCH v3 3/3] EDAC/amd64: Enumerate memory on noncpu nodes

From: Yazen Ghannam <yazen.ghannam@amd.com>
Date: 2021-09-01 18:42:39
Also in: lkml

On Fri, Aug 27, 2021 at 01:30:57PM +0200, Borislav Petkov wrote:
On Tue, Aug 24, 2021 at 12:24:37AM +0530, Naveen Krishna Chatradhi wrote:
...
quoted
@@ -1335,6 +1392,11 @@ static void determine_memory_type(struct amd64_pvt *pvt)
 	u32 dram_ctrl, dcsm;
 
 	if (pvt->umc) {
+		if (pvt->is_noncpu) {
+			pvt->dram_type = MEM_HBM2;
+			return;
+		}
I don't like this sprinkling of "if (pvt->is_noncpu)" everywhere,
at all. Please define a separate read_mc_regs_df() or so which
contains only the needed functionalty which you can carve out from
read_mc_regs().
I like this idea.
 
quoted
+
 		if ((pvt->umc[0].dimm_cfg | pvt->umc[1].dimm_cfg) & BIT(5))
 			pvt->dram_type = MEM_LRDDR4;
 		else if ((pvt->umc[0].dimm_cfg | pvt->umc[1].dimm_cfg) & BIT(4))
@@ -1724,7 +1786,10 @@ static int f17_early_channel_count(struct amd64_pvt *pvt)
 
 	/* SDP Control bit 31 (SdpInit) is clear for unused UMC channels */
 	for_each_umc(i)
-		channels += !!(pvt->umc[i].sdp_ctrl & UMC_SDP_INIT);
+		if (pvt->is_noncpu)
+			channels += pvt->csels[i].b_cnt;
+		else
+			channels += !!(pvt->umc[i].sdp_ctrl & UMC_SDP_INIT);
 
 	amd64_info("MCT channel count: %d\n", channels);
 
No, a separate gpu_early_channel_count() is needed here. There's a
reason for those function pointers getting assigned depending on family.
Good point.
 
...
quoted
+/*
+ * The CPUs have one channel per UMC, So a UMC number is equivalent to a
+ * channel number. The NONCPUs have 8 channels per UMC, so the UMC number no
+ * longer works as a channel number.
+ * The channel number within a NONCPU UMC is given in MCA_IPID[15:12].
+ * However, the IDs are split such that two UMC values go to one UMC, and
+ * the channel numbers are split in two groups of four.
+ *
+ * Refer comment on get_noncpu_umc_base() from amd64_edac.h
+ *
+ * For example,
+ * UMC0 CH[3:0] = 0x0005[3:0]000
+ * UMC0 CH[7:4] = 0x0015[3:0]000
+ * UMC1 CH[3:0] = 0x0025[3:0]000
+ * UMC1 CH[7:4] = 0x0035[3:0]000
+ */
+static int find_umc_channel_noncpu(struct mce *m)
+{
+	u8 umc = find_umc_channel(m);
+	u8 ch = ((m->ipid >> 12) & 0xf);
+
+	return umc % 2 ? (ch + 4) : ch;
+}
+
 static void decode_umc_error(int node_id, struct mce *m)
 {
 	u8 ecc_type = (m->status >> 45) & 0x3;
@@ -2897,6 +3003,7 @@ static void decode_umc_error(int node_id, struct mce *m)
 	struct amd64_pvt *pvt;
 	struct err_info err;
 	u64 sys_addr;
+	u8 df_inst_id;
You don't need that variable and can work with err.channel just fine.
quoted
 	mci = edac_mc_find(node_id);
 	if (!mci)
@@ -2909,7 +3016,22 @@ static void decode_umc_error(int node_id, struct mce *m)
 	if (m->status & MCI_STATUS_DEFERRED)
 		ecc_type = 3;
 
-	err.channel = find_umc_channel(m);
+	if (pvt->is_noncpu) {
+		/*
+		 * The NONCPUs have one Chip Select per UMC, so the UMC number
+		 * can used as the Chip Select number. However, the UMC number
+		 * is split in the ID value so it's necessary to divide by 2.
+		 */
+		err.csrow = find_umc_channel(m) / 2;
+		err.channel = find_umc_channel_noncpu(m);
+		/* On NONCPUs, instance id is calculated as below. */
+		df_inst_id = err.csrow * 8 + err.channel;
		err.channel += err.csrow * 8;

tadaaa!
err.channel still needs to be used in error_address_to_page_and_offset()
below. So changing it here messes up what's reported to EDAC.
 
...
quoted
@@ -3804,6 +3963,9 @@ static int probe_one_instance(unsigned int nid)
 	struct ecc_settings *s;
 	int ret;
 
+	if (!F3)
+		return 0;
+
 	ret = -ENOMEM;
 	s = kzalloc(sizeof(struct ecc_settings), GFP_KERNEL);
 	if (!s)
@@ -3815,6 +3977,9 @@ static int probe_one_instance(unsigned int nid)
 	if (!pvt)
 		goto err_settings;
 
+	if (nid >= NONCPU_NODE_INDEX)
+		pvt->is_noncpu = true;
This is silly and error-prone. Proper detection should happen in
per_family_init() and there you should read out from the hardware
whether this is a GPU or a CPU node.

Then, you should put an enum type in amd64_family_type which has

 { FAM_TYPE_CPU, FAM_TYPE_GPU, ... }

etc and the places where you need to check whether it is CPU or a GPU,
test those types.
This is a good idea. But we have a global *fam_type, so this should be moved
into struct amd64_pvt, if possible. Then each node can have its own fam_type.

..
quoted
@@ -389,6 +392,9 @@ struct amd64_pvt {
 	enum mem_type dram_type;
 
 	struct amd64_umc *umc;	/* UMC registers */
+	char buf[20];
A 20 char buffer in every pvt structure just so that you can sprintf
into it when it is a GPU? Err, I don't think so.

You can do the same thing as with the CPUs - the same string for every
pvt instance.
Fair point. I like the idea of having unique names though. Is this possible
with the current EDAC framework? Or is it not worth it?

Thanks,
Yazen
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help