RE: [EXT] Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter support

From: Bhaskara Budiredla <hidden>
Date: 2021-10-26 18:04:42
Also in: linux-arm-kernel, lkml

-----Original Message-----
From: Will Deacon <will@kernel.org>
Sent: Tuesday, October 26, 2021 3:14 PM
To: Bhaskara Budiredla <redacted>
Cc: mark.rutland@arm.com; robh+dt@kernel.org; Bharat Bhushan
[off-list ref]; Sunil Kovvuri Goutham
[off-list ref]; linux-arm-kernel@lists.infradead.org;
devicetree@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [EXT] Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter
support

External Email

----------------------------------------------------------------------
On Mon, Oct 18, 2021 at 09:00:56PM +0530, Bhaskara Budiredla wrote:

quoted

This driver adds support for Last-level cache tag-and-data unit
(LLC-TAD) PMU that is featured in some of the Marvell's CN10K
infrastructure silicons.

The LLC is divided into 2N slices distributed across N Mesh tiles in a
single-socket configuration. The driver always configures the same
counter for all of the TADs. The user would end up effectively
reserving one of eight counters in every TAD to look across all TADs.
The occurrences of events are aggregated and presented to the user at
the end of an application run. The driver does not provide a way for
the user to partition TADs so that different TADs are used for
different applications.

Is that something you will want to do in the future? If you go with your current
approach of exposing a single "tad" unit to userspace, then you won't be able
to change that.

There is no intension to do partitioning of the TADs. I have thrown some light on it as it is a point to be stressed upon.

For the L3 PMUs (including on TX2). we expose per-node PMUs so why
shouldn't we do something similar here and expose each TAD region
separately? Even if userspace drives them all together, it gives you more
flexibility in the future if you _do_ want to be partition them up.

Marvell has no plans of providing per-node pmu statistics on CN10k platforms.

quoted

diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index

5260b116c7da..2db5418d5b0a 100644

--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile

@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) +=

thunderx2_pmu.o

quoted

 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
 obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) +=

marvell_cn10k_tad_pmu.o

quoted

diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c

b/drivers/perf/marvell_cn10k_tad_pmu.c
new file mode 100644
index 000000000000..aebb1a0028dc

--- /dev/null
+++ b/drivers/perf/marvell_cn10k_tad_pmu.c

@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Marvell CN10K LLC-TAD perf driver
+ *
+ * Copyright (C) 2021 Marvell
+ *
+ * This program is free software; you can redistribute it and/or
+modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) "tad_pmu: " fmt
+
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_device.h>
+#include <linux/cpuhotplug.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+
+#define TAD_PFC_OFFSET		0x0
+#define TAD_PFC(counter)	(TAD_PFC_OFFSET | (counter << 3))
+#define TAD_PRF_OFFSET		0x100
+#define TAD_PRF(counter)	(TAD_PRF_OFFSET | (counter << 3))
+#define TAD_PRF_CNTSEL_MASK	0xFF
+#define TAD_MAX_COUNTERS	8
+
+#define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu))
+
+struct tad_region {
+	void __iomem	*base;
+};
+
+struct tad_pmu {
+	struct pmu pmu;
+	struct tad_region *regions;
+	u32 region_cnt;
+	unsigned int cpu;
+	struct hlist_node node;
+	struct perf_event *events[TAD_MAX_COUNTERS];
+	DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS); };
+
+static int tad_pmu_cpuhp_state;
+
+static void tad_pmu_event_counter_read(struct perf_event *event) {
+	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	u32 counter_idx = hwc->idx;
+	u64 delta, prev, new;
+	int i;
+
+	do {
+		prev = local64_read(&hwc->prev_count);
+		for (i = 0, new = 0; i < tad_pmu->region_cnt; i++)
+			new += readq(tad_pmu->regions[i].base +
+				     TAD_PFC(counter_idx));
+	} while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev);

If we expose each TAD individually, then this won't matter, but I'd be inclined
to move the counter summation outside of the cmpxchg() loop given that
readq (why not _relaxed?) is probably quite slow.

As clarified above partitioning of TADs is ruled out and situation of exposing individual TADs inappropriate.

quoted

+
+	delta = (new - prev) & GENMASK_ULL(63, 0);

This mask doesn't do anything.

Agreed, and will delete it.

quoted

+	local64_add(delta, &event->count);
+}
+
+static void tad_pmu_event_counter_stop(struct perf_event *event, int
+flags) {
+	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	u32 counter_idx = hwc->idx;
+	int i;
+
+	/* TAD()_PFC() stop counting on the write
+	 * which sets TAD()_PRF()[CNTSEL] == 0
+	 */
+	for (i = 0; i < tad_pmu->region_cnt; i++)
+		writeq_relaxed(0, tad_pmu->regions[i].base +
+			       TAD_PRF(counter_idx));

Please use braces around a multi-line conditional statement.

Taken.

Will


Thanks,
Bhaskara

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help