Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Ganapatrao Kulkarni <hidden>
Date: 2018-05-15 10:33:24
Also in:
linux-arm-kernel, lkml
Hi Mark, On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni [off-list ref] wrote:
Hi Mark, On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland [off-list ref] wrote:quoted
Hi, On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:quoted
+ +/* L3c and DMC has 16 and 8 channels per socket respectively. + * Each Channel supports UNCORE PMU device and consists of + * 4 independent programmable counters. Counters are 32 bit + * and does not support overflow interrupt, they needs to be + * sampled before overflow(i.e, at every 2 seconds). + */ + +#define UNCORE_MAX_COUNTERS 4 +#define UNCORE_L3_MAX_TILES 16 +#define UNCORE_DMC_MAX_CHANNELS 8 + +#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)How was a period of two seconds chosen?It has been suggested from hw team to sample before 3 to 4 seconds.quoted
What's the maximum clock speed for the L3C and DMC?L3C at 1.5GHz and DMC at 1.2GHzquoted
Given that all channels compete for access to the muxed register interface, I suspect we need to try more often than once every 2 seconds...2 seconds seems to be sufficient. So far testing looks good.quoted
[...]quoted
+struct active_timer { + struct perf_event *event; + struct hrtimer hrtimer; +}; + +/* + * pmu on each socket has 2 uncore devices(dmc and l3), + * each uncore device has up to 16 channels, each channel can sample + * events independently with counters up to 4. + * + * struct thunderx2_pmu_uncore_channel created per channel. + * struct thunderx2_pmu_uncore_dev per uncore device. + */ +struct thunderx2_pmu_uncore_channel { + struct thunderx2_pmu_uncore_dev *uncore_dev; + struct pmu pmu;Can we put the pmu first in the struct, please?okquoted
quoted
+ int counter;AFAICT, this counter field is never used.hmm ok, will remove.quoted
quoted
+ int channel; + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS); + struct active_timer *active_timers;You should only need a single timer per channel, rather than one per event. I think you can get rid of the active_timer structure, and have: struct perf_event *events[UNCORE_MAX_COUNTERS]; struct hrtimer timer;thanks, will change as suggested.quoted
quoted
+ /* to sync counter alloc/release */ + raw_spinlock_t lock; +}; + +struct thunderx2_pmu_uncore_dev { + char *name; + struct device *dev; + enum thunderx2_uncore_type type; + unsigned long base;This should be: void __iomem *base;okquoted
[...]quoted
+static ssize_t cpumask_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpumask cpu_mask; + struct thunderx2_pmu_uncore_channel *pmu_uncore = + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev)); + + /* Pick first online cpu from the node */ + cpumask_clear(&cpu_mask); + cpumask_set_cpu(cpumask_first( + cpumask_of_node(pmu_uncore->uncore_dev->node)), + &cpu_mask); + + return cpumap_print_to_pagebuf(true, buf, &cpu_mask); +}AFAICT cpumask_of_node() returns a mask that can include offline CPUs. Regardless, I don't think that you can keep track of the management CPU this way. Please keep track of the CPU the PMU should be managed by, either with a cpumask or int embedded within the PMU structure.thanks, will add hotplug callbacks.quoted
At hotplug time, you'll need to update the management CPU, calling perf_pmu_migrate_context() when it is offlined.okquoted
[...]quoted
+static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore) +{ + int counter; + + raw_spin_lock(&pmu_uncore->lock); + counter = find_first_zero_bit(pmu_uncore->counter_mask, + pmu_uncore->uncore_dev->max_counters); + if (counter == pmu_uncore->uncore_dev->max_counters) { + raw_spin_unlock(&pmu_uncore->lock); + return -ENOSPC; + } + set_bit(counter, pmu_uncore->counter_mask); + raw_spin_unlock(&pmu_uncore->lock); + return counter; +} + +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore, + int counter) +{ + raw_spin_lock(&pmu_uncore->lock); + clear_bit(counter, pmu_uncore->counter_mask); + raw_spin_unlock(&pmu_uncore->lock); +}I don't believe that locking is required in either of these, as the perf core serializes pmu::add() and pmu::del(), where these get called.
without this locking, i am seeing "BUG: scheduling while atomic" when i run perf with more events together than the maximum counters supported
thanks, it seems to be not required.quoted
quoted
+ +/* + * DMC and L3 counter interface is muxed across all channels. + * hence we need to select the channel before accessing counter + * data/control registers.Are there separate interfaces for all-dmc-channels and all-l3c-channels?separate interface for DMC and L3c. channels within DMC/L3C are muxed.quoted
... or is a single interface used by all-dmc-and-l3c-channels?quoted
+ * + * L3 Tile and DMC channel selection is through SMC call + * SMC call arguments, + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id) + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel) + * x2 = Node idHow do we map Linux node IDs to the firmware's view of node IDs? I don't believe the two are necessarily the same -- Linux's node IDs are a Linux-specific construct.both are same, it is numa node id from ACPI/firmware.quoted
It would be much nicer if we could pass something based on the MPIDR, which is a known HW construct, or if this implicitly affected the current node.IMO, node id is sufficient.quoted
It would be vastly more sane for this to not be muxed at all. :/i am helpless due to crappy hw design!quoted
quoted
+ * x3 = DMC(1)/L3C(0) + * x4 = channel Id + */ +static void uncore_select_channel(struct perf_event *event) +{ + struct arm_smccc_res res; + struct thunderx2_pmu_uncore_channel *pmu_uncore = + pmu_to_thunderx2_pmu_uncore(event->pmu); + + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL, + pmu_uncore->uncore_dev->node, + pmu_uncore->uncore_dev->type, + pmu_uncore->channel, 0, 0, 0, &res); + +}Why aren't we checking the return value of the SMC call?thanks, i will add.quoted
The muxing and SMC sound quite scary. :/i too agree!, however i have no other option since the muxing register is a secure register.quoted
quoted
+ +static void uncore_start_event_l3c(struct perf_event *event, int flags) +{ + u32 val; + struct hw_perf_event *hwc = &event->hw; + + /* event id encoded in bits [07:03] */ + val = GET_EVENTID(event) << 3; + reg_writel(val, hwc->config_base); + + if (flags & PERF_EF_RELOAD) { + u64 prev_raw_count = + local64_read(&event->hw.prev_count); + reg_writel(prev_raw_count, hwc->event_base); + } + local64_set(&event->hw.prev_count, + reg_readl(hwc->event_base));It would be simpler to ignore PERF_EF_RELOAD, and always reprogram the prev_count and HW to zero.ok, will changequoted
quoted
+} + +static void uncore_start_event_dmc(struct perf_event *event, int flags) +{ + u32 val, event_shift = 8; + struct hw_perf_event *hwc = &event->hw; + + /* enable and start counters and read current value in prev_count */ + val = reg_readl(hwc->config_base); + + /* 8 bits for each counter, + * bits [05:01] of a counter to set event type. + */ + reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) * + event_shift) + 1))) | + (GET_EVENTID(event) << + (((GET_COUNTERID(event)) * event_shift) + 1)), + hwc->config_base);This would be far more legible if val were constructed before the reg_writel(), especially if there were a helper for the event field shifting, e.g. #define DMC_EVENT_CFG(idx, val) ((val) << (((idx) * 8) + 1)) int id = GET_COUNTERID(event); int event = GET_EVENTID(event); val = reg_readl(hwc->config_base); val &= ~DMC_EVENT_CFG(id, 0x1f); val |= DMCC_EVENT_CFG(id, event); reg_writel(val, hwc->config_base)); What are bits 7:6 and 1 used for?not used/reserved bits.quoted
quoted
+ + if (flags & PERF_EF_RELOAD) { + u64 prev_raw_count = + local64_read(&event->hw.prev_count); + reg_writel(prev_raw_count, hwc->event_base); + } + local64_set(&event->hw.prev_count, + reg_readl(hwc->event_base));As with the L3C events, it would be simpler to always reprogram the prev_count and HW to zero.okquoted
quoted
+} + +static void uncore_stop_event_l3c(struct perf_event *event) +{ + reg_writel(0, event->hw.config_base); +} + +static void uncore_stop_event_dmc(struct perf_event *event) +{ + u32 val, event_shift = 8; + struct hw_perf_event *hwc = &event->hw; + + val = reg_readl(hwc->config_base); + reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))), + hwc->config_base);This could be simplified with the helper proposed above.okquoted
quoted
+} + +static void init_cntr_base_l3c(struct perf_event *event, + struct thunderx2_pmu_uncore_dev *uncore_dev) +{ + + struct hw_perf_event *hwc = &event->hw; + + /* counter ctrl/data reg offset at 8 */Offset 8, or stride 8?stride 8quoted
What does the register layout look like?quoted
+ hwc->config_base = uncore_dev->base + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event)); + hwc->event_base = uncore_dev->base + + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event)); +} + +static void init_cntr_base_dmc(struct perf_event *event, + struct thunderx2_pmu_uncore_dev *uncore_dev) +{ + + struct hw_perf_event *hwc = &event->hw; + + hwc->config_base = uncore_dev->base + + DMC_COUNTER_CTL; + /* counter data reg offset at 0xc */A stride of 0xc seems unusual. What does the register layout look like?the offsets are at 0x640, 0x64c, 0x658, 0x664,quoted
quoted
+ hwc->event_base = uncore_dev->base + + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event)); +} + +static void thunderx2_uncore_update(struct perf_event *event) +{ + s64 prev, new = 0; + u64 delta; + struct hw_perf_event *hwc = &event->hw; + struct thunderx2_pmu_uncore_channel *pmu_uncore; + enum thunderx2_uncore_type type; + + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu); + type = pmu_uncore->uncore_dev->type;AFAICT this variable is not used below.thanks.quoted
quoted
+ + if (pmu_uncore->uncore_dev->select_channel) + pmu_uncore->uncore_dev->select_channel(event);This should always be non-NULL, right?yes, unwanted check at the moment, will remove.quoted
[...]quoted
+static bool thunderx2_uncore_validate_event_group(struct perf_event *event) +{ + struct pmu *pmu = event->pmu; + struct perf_event *leader = event->group_leader; + struct perf_event *sibling; + int counters = 0; + + if (leader->pmu != event->pmu && !is_software_event(leader)) + return false; + + counters++;I don't think this is right when event != leader and the leader is a SW event. In that case, the leader doesn't take a HW counter.I think this check to avoid the grouping of multiple hw PMUs , however it is allowed to group sw events along with hw PMUs.quoted
quoted
+ + for_each_sibling_event(sibling, event->group_leader) { + if (is_software_event(sibling)) + continue; + if (sibling->pmu != pmu) + return false; + counters++; + } + + /* + * If the group requires more counters than the HW has, + * it cannot ever be scheduled. + */ + return counters <= UNCORE_MAX_COUNTERS; +}[...]quoted
+static int thunderx2_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + struct thunderx2_pmu_uncore_channel *pmu_uncore;quoted
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu); + + if (!pmu_uncore) + return -ENODEV;This cannot happen, given pmu_to_thunderx2_pmu_uncore() is a wrapper around container_of().thanks, unnecessary check!quoted
quoted
+ + /* Pick first online cpu from the node */ + event->cpu = cpumask_first( + cpumask_of_node(pmu_uncore->uncore_dev->node));I don't believe this is safe. You must keep track of which CPU is managing the PMU, with hotplug callbacks.ok, will add callbacks.quoted
[...]quoted
+static void thunderx2_uncore_start(struct perf_event *event, int flags) +{ + struct hw_perf_event *hwc = &event->hw; + struct thunderx2_pmu_uncore_channel *pmu_uncore; + struct thunderx2_pmu_uncore_dev *uncore_dev; + unsigned long irqflags; + struct active_timer *active_timer; + + hwc->state = 0; + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu); + uncore_dev = pmu_uncore->uncore_dev; + + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags); + + if (uncore_dev->select_channel) + uncore_dev->select_channel(event); + uncore_dev->start_event(event, flags); + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags); + + perf_event_update_userpage(event); + active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)]; + active_timer->event = event; + + hrtimer_start(&active_timer->hrtimer, + ns_to_ktime(uncore_dev->hrtimer_interval), + HRTIMER_MODE_REL_PINNED); +}Please use a single hrtimer, and update *all* of the events when it fires.surequoted
I *think* that can be done in the pmu::pmu_enable() and pmu::pmu_disable() callbacks.okquoted
Are there control bits to enable/disable all counters, or can that only be done through the event configuration registers?only through config register.quoted
quoted
+static void thunderx2_uncore_stop(struct perf_event *event, int flags) +{ + struct hw_perf_event *hwc = &event->hw; + struct thunderx2_pmu_uncore_channel *pmu_uncore; + struct thunderx2_pmu_uncore_dev *uncore_dev; + unsigned long irqflags; + + if (hwc->state & PERF_HES_UPTODATE) + return; + + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu); + uncore_dev = pmu_uncore->uncore_dev; + + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags); + + if (uncore_dev->select_channel) + uncore_dev->select_channel(event);AFAICT this cannot be NULL.ok.quoted
[...]quoted
+static int thunderx2_pmu_uncore_register( + struct thunderx2_pmu_uncore_channel *pmu_uncore) +{ + struct device *dev = pmu_uncore->uncore_dev->dev; + char *name = pmu_uncore->uncore_dev->name; + int channel = pmu_uncore->channel; + + /* Perf event registration */ + pmu_uncore->pmu = (struct pmu) { + .attr_groups = pmu_uncore->uncore_dev->attr_groups, + .task_ctx_nr = perf_invalid_context, + .event_init = thunderx2_uncore_event_init, + .add = thunderx2_uncore_add, + .del = thunderx2_uncore_del, + .start = thunderx2_uncore_start, + .stop = thunderx2_uncore_stop, + .read = thunderx2_uncore_read, + }; + + pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL, + "%s_%d", name, channel);Does the channel idx take the NUMA node into account?name already has node id suffix.quoted
[...]quoted
+static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev, + int channel) +{quoted
+ /* we can run up to (max_counters * max_channels) events + * simultaneously, allocate hrtimers per channel. + */ + pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev, + sizeof(struct active_timer) * uncore_dev->max_counters, + GFP_KERNEL);Please just fold a single hrtimer into the thunderx2_pmu_uncore_channel structure, and you can get rid of this allocation...sure, i will rewrite code to have timer per channel and all active counters are updated when timer fires.quoted
quoted
+ + for (counter = 0; counter < uncore_dev->max_counters; counter++) { + hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer, + CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + pmu_uncore->active_timers[counter].hrtimer.function = + thunderx2_uncore_hrtimer_callback; + }... and simplify this initialization.okquoted
[...]quoted
+static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev( + struct device *dev, acpi_handle handle, + struct acpi_device *adev, u32 type) +{ + struct thunderx2_pmu_uncore_dev *uncore_dev; + unsigned long base;quoted
+ base = (unsigned long)devm_ioremap_resource(dev, &res); + if (IS_ERR((void *)base)) { + dev_err(dev, "PMU type %d: Fail to map resource\n", type); + return NULL; + }Please treat this as a void __iomem *base.okquoted
[...]quoted
+static int thunderx2_uncore_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct arm_smccc_res res; + + set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev))); + + /* Make sure firmware supports DMC/L3C set channel smc call */ + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL, + dev_to_node(dev), 0, 0, 0, 0, 0, &res); + if (res.a0) { + dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n", + dev_to_node(dev)); + return -ENODEV; + }Please re-use the uncore_select_channel() wrapper rather than open-coding this.ok.quoted
Which FW supports this interface?it is through vendor specific ATF runtime services.quoted
Thanks, Mark.thanks, Ganapat
thanks Ganapat -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html