Re: [RFC PATCH 1/1] psi: Introduce in-kernel PSI auto monitor feature
From: Pintu Kumar Agarwal <hidden>
Date: 2026-07-03 15:32:55
Also in:
lkml
Hi Prateek, Thank you so much for your review feedback and comments. Please find my response below. On Fri, Jul 3, 2026 at 1:21 AM K Prateek Nayak [off-list ref] wrote:
Hello Pintu, On 7/2/2026 10:46 PM, Pintu Kumar Agarwal wrote:quoted
diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c index e2cf3b08d4e9..30e9800ce947 100644 --- a/kernel/sched/build_utility.c +++ b/kernel/sched/build_utility.c@@ -104,3 +104,7 @@ #ifdef CONFIG_SCHED_AUTOGROUP # include "autogroup.c" #endif + +#ifdef CONFIG_PSI_AUTO_MONITOR +# include "psi_monitor.c" +#endifIsn't this a module? Why is this being included as a scheduler file? Based on a quick glance, nothing in this module needs scheduler internal APIs (and nor it should) so tools/sched/ would probabaly be a better place to put it in if there is interest for this feature.
The scheduler placement was chosen because the feature currently operates on PSI internals and was developed as an extension to kernel/sched/psi.c. I am open to alternative placement if another location is more appropriate.
quoted
diff --git a/kernel/sched/psi_monitor.c b/kernel/sched/psi_monitor.c new file mode 100644 index 000000000000..e929a0c05494 --- /dev/null +++ b/kernel/sched/psi_monitor.c@@ -0,0 +1,307 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PSI Automatic Monitor with Weighted Task Ranking + Tracepoints + * + * Periodically samples system PSI (CPU, memory, IO) and, when any + * configured threshold is exceeded, ranks tasks using a composite + * score based on RSS, I/O activity and CPU time, then logs the + * top-N tasks via printk and a tracepoint. + * + * Sysfs interface: + * /sys/kernel/psi_monitor/cpu_thresh (percentage) + * /sys/kernel/psi_monitor/mem_thresh (percentage) + * /sys/kernel/psi_monitor/io_thresh (percentage) + * /sys/kernel/psi_monitor/monitor_interval_ms (milliseconds) + * /sys/kernel/psi_monitor/rss_weight + * /sys/kernel/psi_monitor/io_weight + * /sys/kernel/psi_monitor/cpu_weight + * + * Author: Pintu Kumar Agarwal + */ + +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/sched/signal.h> +#include <linux/sched/loadavg.h> +#include <linux/mm.h> +#include <linux/delay.h> +#include <linux/workqueue.h> +#include <linux/psi_types.h> +#include <linux/kobject.h> +#include <linux/sort.h> +#include <linux/jiffies.h> +#include <linux/time64.h> +#include <linux/sched/cputime.h> + +/* Create tracepoints defined in include/trace/events/psi_monitor.h */ +#define CREATE_TRACE_POINTS +#include <linux/psi.h> +#include <trace/events/psi_monitor.h> + + +/* Sysfs tunables */ +static unsigned int cpu_thresh = 80; /* in percent */ +static unsigned int mem_thresh = 80; /* in percent */ +static unsigned int io_thresh = 80; /* in percent */ +static unsigned int monitor_interval_ms = 10000; + +/* scoring weights */ +static unsigned int rss_weight = 2; +static unsigned int io_weight = 1; +static unsigned int cpu_weight = 5;Insanely configurable but what makes it easy for developers to know the right configurations under severe pressure as you put it?
This is one of the goal of RFC to decide what parameters should be kept configurable and to what default values. Currently it is at experimental stage and gathering feedback. As per my experiments (2 core, < 1GB RAM), 80% number is good enough as default. This triggers only during extreme pressure, otherwise kept silent. Based on the scenario and workload, users can configure it. Run scenario => check workload => configure => rerun Scoring weight values are optional and it depends on which load we want to give priority. These are open for discussions and only needed for sorting logic.
quoted
+ +static struct delayed_work psi_work; +static struct kobject *psi_kobj; + +#define TOP_N 20 + +struct task_info { + struct task_struct *task; + unsigned long rss; /* pages */ + unsigned long io_kb; /* kB */ + unsigned long cpu_ms; /* ms */Isn't the suffix selfexplanatory? Do you really need the comments?
oh yes, this can be removed if not needed.
quoted
+ u64 score; +}; + +/* + * psi_avg10_percent() - derive a rough integer percentage from avg10 + * for a given PSI state (e.g. PSI_CPU_SOME, PSI_MEM_SOME, PSI_IO_SOME). + * + * psi_group.avg[state][0] is the avg10 window in fixed-point notation. + * The conversion here is approximate but monotonic, which is sufficient + * for thresholding and ranking in this internal monitor. + */ +static unsigned long psi_avg10_percent(int state) +{ + u64 avg10; + + if (state < 0 || state >= NR_PSI_STATES) + return 0; + + avg10 = READ_ONCE(psi_system.avg[state][0]); + if (!avg10) + return 0; + + /* Convert back from loadavg-style fixed-point to an approximate % */ + /* Just consider the integer value and ignore fraction */Why two single line comments?
ok I will merge it in the next version.
quoted
+ return LOAD_INT(avg10); +} + +static int compare_score_desc(const void *a, const void *b) +{ + const struct task_info *ta = a; + const struct task_info *tb = b; + + if (tb->score > ta->score) + return 1; + if (tb->score < ta->score) + return -1; + return 0; +} + +static void log_top_tasks(void) +{ + struct task_info tasks[TOP_N]; + struct task_struct *p, *t; + int count = 0; + int i; + + rcu_read_lock(); + for_each_process_thread(p, t) {Thats a ton of work every 10s.
This happens only when the threshold is breached and the system is already under pressure. Based on the feedback we can rate-limit this.
quoted
+ struct mm_struct *mm; + unsigned long rss = 0; + unsigned long io_kb = 0; + unsigned long cpu_ms = 0; + u64 score; + + /* Ignore tasks that are not on run queue or idle */ + if (!t->on_rq && !is_idle_task(t))Condition doesn't match the comment. Tasks off rq that aren't idle will still go through.
Oh yes, good catch. I will fix the comment in the next version.
quoted
+ continue; + + mm = get_task_mm(t); + + /* mm could be NULL for kernel threads */ + if (mm) { + rss = mm ? get_mm_rss(mm) : 0; + mmput_async(mm); + } + + /* + * Approximate I/O activity: sum of read + write bytes. + * This uses the task_io_accounting fields in task_struct. + * Values are best-effort and need not be perfectly accurate + * for our ranking purpose. + */ + io_kb = (t->ioac.read_bytes + t->ioac.write_bytes) >> 10; + + /* + * Approximate CPU usage via task_sched_runtime(), converted + * to milliseconds. This is cumulative since task start, but + * is still useful for comparing hotspots at a given point. + */ + cpu_ms = (unsigned long)(task_sched_runtime(t) / NSEC_PER_MSEC); + + score = (u64)rss_weight * (u64)rss + + (u64)io_weight * (u64)io_kb + + (u64)cpu_weight * (u64)cpu_ms; + + if (count < TOP_N) { + tasks[count].task = t; + tasks[count].rss = rss; + tasks[count].io_kb = io_kb; + tasks[count].cpu_ms = cpu_ms; + tasks[count].score = score; + count++; + } else { + /* Maintain a simple streaming top-N: replace smallest */ + int min_idx = 0; + int j; + + for (j = 1; j < TOP_N; j++) { + if (tasks[j].score < tasks[min_idx].score) + min_idx = j; + }Can't you just cache the min_idx and re-compute it when it changes instead of taking a O(20) iteration for every task?
ok will think about it and come back. IMO the constant value may not affect the order. Currently at RFC stage I wanted to keep things as simple.
quoted
+ + if (score > tasks[min_idx].score) { + tasks[min_idx].task = t; + tasks[min_idx].rss = rss; + tasks[min_idx].io_kb = io_kb; + tasks[min_idx].cpu_ms = cpu_ms; + tasks[min_idx].score = score; + } + } + } + rcu_read_unlock(); + + sort(tasks, count, sizeof(struct task_info), compare_score_desc, NULL); + + pr_info("psi_monitor: logging top %d tasks under pressure:\n", count); + + for (i = 0; i < count; i++) { + struct task_struct *ts = tasks[i].task; + unsigned long rss_kb = tasks[i].rss << (PAGE_SHIFT - 10); + char name[128] = {0,}; + + if (ts->flags & PF_WQ_WORKER) + wq_worker_comm(name, sizeof(name), ts); + else + scnprintf(name, sizeof(name) - 1, ts->comm); + + trace_psi_monitor_top_task(ts->pid, name, + tasks[i].cpu_ms, + rss_kb, + tasks[i].io_kb, + tasks[i].score); + + pr_info("psi_monitor: pid=%d comm=%s psi_flag=%d oncpu=%d cputime(ms)=%lu rss(kB)=%lu io(kB)=%lu score=%llu\n", + ts->pid, name, ts->psi_flags, task_cpu(ts), + tasks[i].cpu_ms, rss_kb, tasks[i].io_kb, + (unsigned long long)tasks[i].score);This will unnecessarily dump to dmesg even if you have tracevent enabled. Why?
This is also one point of discussion for RFC. Currently have kept both the options available and open for suggestion. The idea is to dump it like OOM message and only during pressure and threshold breach when we really need it. Once the pressure releases this will automatically stop. We can also make it as pr_debug or ratelimit or even put it under another CONFIG. The idea is to automatically get the information in logs instead of user intervention. But I am open for suggestions.
quoted
+ } +} + +static void psi_monitor_fn(struct work_struct *work) +{ + unsigned long cpu_pct, mem_pct, io_pct; + bool trigger = false; + + cpu_pct = psi_avg10_percent(PSI_CPU_SOME); + mem_pct = psi_avg10_percent(PSI_MEM_SOME); + io_pct = psi_avg10_percent(PSI_IO_SOME); + + if (cpu_pct >= cpu_thresh || mem_pct >= mem_thresh || + io_pct >= io_thresh) + trigger = true; + + if (trigger) { + pr_info("psi_monitor: pressure high: cpu=%lu%% mem=%lu%% io=%lu%% (thresh cpu=%u mem=%u io=%u)\n", + cpu_pct, mem_pct, io_pct, + cpu_thresh, mem_thresh, io_thresh); + log_top_tasks(); + } + + queue_delayed_work(system_wq, &psi_work, + msecs_to_jiffies(monitor_interval_ms));If I set monitor_interval_ms to 6 hours, and then change it back to 10s, it'll only take effect after this callback has fired 6 hours later.
Oh yes, good catch, I will fix this in the next version. So, it should override with the new request.
quoted
+} + +/* Sysfs helpers */ +#define PSI_ATTR_RW(_name) \ +static ssize_t _name##_show(struct kobject *kobj, \ + struct kobj_attribute *attr, char *buf) \ +{ \ + return sysfs_emit(buf, "%u\n", _name); \ +} \ +static ssize_t _name##_store(struct kobject *kobj, \ + struct kobj_attribute *attr, \ + const char *buf, size_t count) \ +{ \ + unsigned int val; \ + if (kstrtouint(buf, 10, &val)) \ + return -EINVAL; \ + _name = val; \ + return count; \ +} \ +static struct kobj_attribute _name##_attr = __ATTR_RW(_name) + +PSI_ATTR_RW(cpu_thresh); +PSI_ATTR_RW(mem_thresh); +PSI_ATTR_RW(io_thresh); +PSI_ATTR_RW(monitor_interval_ms); +PSI_ATTR_RW(rss_weight); +PSI_ATTR_RW(io_weight); +PSI_ATTR_RW(cpu_weight); + +static struct attribute *psi_attrs[] = { + &cpu_thresh_attr.attr, + &mem_thresh_attr.attr, + &io_thresh_attr.attr, + &monitor_interval_ms_attr.attr, + &rss_weight_attr.attr, + &io_weight_attr.attr, + &cpu_weight_attr.attr, + NULL, +}; + +static const struct attribute_group psi_attr_group = { + .attrs = psi_attrs, +}; + +static int __init psi_monitor_init(void) +{ + int ret; + + INIT_DELAYED_WORK(&psi_work, psi_monitor_fn); + queue_delayed_work(system_wq, &psi_work, + msecs_to_jiffies(monitor_interval_ms)); + + psi_kobj = kobject_create_and_add("psi_monitor", kernel_kobj); + if (!psi_kobj) + return -ENOMEM; + + ret = sysfs_create_group(psi_kobj, &psi_attr_group); + if (ret) { + kobject_put(psi_kobj); + cancel_delayed_work_sync(&psi_work); + return ret; + } + + pr_info("psi_monitor: in-kernel PSI auto monitor (weighted + tracepoints) loaded\n"); + return 0; +} + +static void __exit psi_monitor_exit(void) +{ + cancel_delayed_work_sync(&psi_work); + if (psi_kobj) + kobject_put(psi_kobj); + pr_info("psi_monitor: unloaded\n"); +} + +module_init(psi_monitor_init); +module_exit(psi_monitor_exit);There is nothing here that warrants putting this in kernel/sched.
The feature depends on sched/psi so I decided to keep it close. But I am open for any location.
Also this gets included by default when config is enabled and starts dumping a bunch of stats to dmesg without anyone asking. No?
This is included as a dependent feature of PSI. If someone enables this CONFIG as part of PSI this indicates that they are interested in getting auto-monitor stats. Also, the dump will happen only if threshold is breached with high default values. However, for RFC stage I wanted to keep things simple. Later, we can add an enable/disable flag in cmdline just like PSI.
Afaict, almost all of the detail used here is also available from procfs and people can easily put together a userspace tool if they need it. Why do we need an in-kernel module?
This is the most fundamental aspect of this auto-monitor feature. This point is already described in the cover letter. Let me put it again: - Get kernel stats early during boot_time before userspace comes up. -> Set slightly lower threshold and boot stats (helps in analysing boot time) - No user intervention or continuous polling or daemons needed (Just enable config and start auto monitoring) - userspace scheduling delays under high pressure - risk of missing short-lived spikes - capturing details as soon as pressure hits and at same timestamp - useful for analysing real-time latency workload. - useful for minimal environment like initramfs or busybox The motivation is not to replace existing PSI interfaces or the ability to build userspace monitoring tools. The goal is attribution at the moment pressure thresholds are crossed. A userspace implementation observes the system after being scheduled, whereas the in-kernel implementation captures contributors at the point where pressure is detected. During LPC-2024 I have done significant changes to core psi module to implement the similar logic. But the feedback was not to disturb the core psi interface, instead develop a separate interface and make it configurable. So, I came up with this auto-monitor idea. For more details please have a look at my OSS paper with data. https://hosted-files.sched.co/ossindia2026/19/OSS-IND-26-PSI-Auto-Monitor.pdf And also the reference data here: https://github.com/pintuk/KERNEL/tree/master/PSI_WORK I am also looking out for someone who can test this on a larger workload and capture data. This will help us to gather insights, how the feature behaves.
quoted
+ +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Pintu Kumar Agarwal"); +MODULE_DESCRIPTION("In-kernel PSI automatic monitor with sysfs, weighted scoring and tracepoints"); -- 2.34.1-- Thanks and Regards, Prateek
Thanks, Pintu