Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
From: Sai Prakash Ranjan <hidden>
Date: 2020-01-13 05:44:46
Also in:
linux-arm-kernel, linux-arm-msm, linux-edac, lkml
Hi Boris, Thanks for the review comments. On 2019-12-30 17:20, Borislav Petkov wrote:
On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:quoted
Kryo{3,4}XX CPU cores implement RAS extensions to support Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.via RAS what? ARM64_RAS_EXTN? In any case, this needs James to look at and especially if there's some ARM-generic functionality in there which should be shared, of course.
Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide the feedback, it has been some time now since I posted this out.
quoted
This adds an interrupt based driver for those CPUs ands/This adds/Add/
Will correct.
quoted
+ +config EDAC_QCOM_KRYO_POLL + depends on EDAC_QCOM_KRYO + bool "Poll on Kryo ECC registers" + help + This option chooses whether or not you want to poll on the Kryo ECC + registers. When this is enabled, the polling rate can be set as a + module parameter. By default, it will call the polling function every + second.Why is this a separate option and why should people use that? Can the polling/irq be switched automatically?
No it cannot be switched automatically. It is used in case some SoCs do not support an irq based mechanism for EDAC. But I am contradicting myself because I am telling that atleast one interrupt should be specified in bindings, so it is best if I drop this polling option for now.
quoted
+ config EDAC_ASPEED tristate "Aspeed AST 2500 SoC" depends on MACH_ASPEED_G5diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index d77200c9680b..29edcfa6ec0e 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile@@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS) += synopsys_edac.o obj-$(CONFIG_EDAC_XGENE) += xgene_edac.o obj-$(CONFIG_EDAC_TI) += ti_edac.o obj-$(CONFIG_EDAC_QCOM) += qcom_edac.o +obj-$(CONFIG_EDAC_QCOM_KRYO) += qcom_kryo_edac.oWhat is the difference between this new driver and the qcom_edac one? Can functionality be shared? Should this new one be called simply kryo_edac instead?
qcom_edac driver is for QCOM system cache(last level cache), it should be renamed to qcom_llcc_edac.c. This new driver is for QCOM Kryo CPU core caches(L1,L2,L3). Functionality cannot be shared as these two are different IP blocks and best kept separate.
quoted
+ +#define DRV_NAME "qcom_kryo_edac" + +/* + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3Chapter? Where? URL?
I chose this because these TRMs are openly available and if you search for these above terms like "Cortex-A76 TRM Chapter B3.3" in google, then the first search result will be the TRM pdf, otherwise I would have to specify the long URL for the pdf and we do not know how long that URL link will be active.
quoted
+ +static const struct error_type err_type[] = { + { edac_device_handle_ce, "Kryo L1 Corrected Error" }, + { edac_device_handle_ue, "Kryo L1 Uncorrected Error" }, + { edac_device_handle_ue, "Kryo L1 Deferred Error" }, + { edac_device_handle_ce, "Kryo L2 Corrected Error" }, + { edac_device_handle_ue, "Kryo L2 Uncorrected Error" }, + { edac_device_handle_ue, "Kryo L2 Deferred Error" }, + { edac_device_handle_ce, "L3 Corrected Error" }, + { edac_device_handle_ue, "L3 Uncorrected Error" }, + { edac_device_handle_ue, "L3 Deferred Error" }, +}; +All that is not really needed - you can put the whole error type detection and dumping in kryo_check_err_type() in nicely readable switch-case statement. No need for the function pointers and special structs.
How is this not easily readable? If I put this in kryo_check_err_type, then there will be nested switch which I think is not so great in terms of readability since it will not fit in one screen and is just more lines of code.
quoted
+static struct edac_device_ctl_info __percpu *edac_dev; +static struct edac_device_ctl_info *drv_edev_ctl; + +static const char *get_error_msg(u64 errxstatus) +{ + const struct error_record *rec; + u32 errxstatus_serr; + + errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus); + + for (rec = serror_record; rec->error_code; rec++) { + if (errxstatus_serr == rec->error_code) + return rec->error_msg; + } + + return NULL; +} + +static void dump_syndrome_reg(int error_type, int level, + u64 errxstatus, u64 errxmisc, + struct edac_device_ctl_info *edev_ctl) +{ + char msg[KRYO_EDAC_MSG_MAX]; + const char *error_msg; + int cpu; + + cpu = raw_smp_processor_id();Why raw_?
Because we will be calling smp_processor_id in preemptible context and if we enable CONFIG_DEBUG_PREEMPT, we would get a nice backtrace. [ 3.747468] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 [ 3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8 [ 3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G S 5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107 [ 3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP (DT) [ 3.779030] Call trace: [ 3.781556] dump_backtrace+0x0/0x158 [ 3.785331] show_stack+0x14/0x20 [ 3.788741] dump_stack+0xb0/0xf4 [ 3.792164] debug_smp_processor_id+0xd8/0xe0 [ 3.796639] qcom_kryo_edac_probe+0x138/0x2b8 [ 3.801116] platform_drv_probe+0x50/0xa8 [ 3.805236] really_probe+0x108/0x360 [ 3.808999] driver_probe_device+0x58/0x100 [ 3.813304] device_driver_attach+0x6c/0x78 [ 3.817606] __driver_attach+0xb0/0xf0 [ 3.821459] bus_for_each_dev+0x68/0xc8 [ 3.825407] driver_attach+0x20/0x28 [ 3.829083] bus_add_driver+0x160/0x1f0 [ 3.833030] driver_register+0x60/0x110 [ 3.836976] __platform_driver_register+0x40/0x48 [ 3.841813] qcom_kryo_edac_driver_init+0x18/0x20 [ 3.846645] do_one_initcall+0x58/0x1a0 [ 3.850596] kernel_init_freeable+0x19c/0x240 [ 3.855075] kernel_init+0x10/0x108 [ 3.858665] ret_from_fork+0x10/0x1c
quoted
+static int kryo_l1_l2_setup_irq(struct platform_device *pdev, + struct edac_device_ctl_info *edev_ctl) +{ + int cpu, errirq, faultirq, ret; + + edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev); + if (!edac_dev) + return -ENOMEM; + + for_each_possible_cpu(cpu) { + preempt_disable(); + per_cpu(edac_dev, cpu) = edev_ctl; + preempt_enable(); + }That sillyness doesn't belong here, if at all.
Sorry but I do not understand the sillyness here. Could you please explain?
...quoted
+static void kryo_poll_cache_error(struct edac_device_ctl_info *edev_ctl) +{ + if (!edev_ctl) + edev_ctl = drv_edev_ctl;That's silly.
Actually its not silly. In case, polling is enabled and on PM exit edev_ctl could be NULL.
quoted
+ + on_each_cpu(kryo_check_l1_l2_ecc, edev_ctl, 1); + kryo_check_l3_scu_ecc(edev_ctl); +}...quoted
+static int qcom_kryo_edac_probe(struct platform_device *pdev) +{ + struct edac_device_ctl_info *edev_ctl; + struct device *dev = &pdev->dev; + int ret; + + qcom_kryo_edac_setup();This function needs to have a return value saying whether it did setup the hw properly or not and the probe function needs to return here if not.
Ok will add a return check. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation