[RFC v2] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot

From: David Woodhouse <dwmw2@infradead.org>
Date: 2026-06-21 22:30:43
Also in: lkml
Subsystem: networking drivers, ptp hardware clock support, ptp vmclock support, the rest, timekeeping, clocksource core, ntp, alarmtimer · Maintainers: Andrew Lunn, "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Richard Cochran, David Woodhouse, Linus Torvalds, John Stultz, Thomas Gleixner

Since the series at
https://lore.kernel.org/all/20260621220051.1030462-1-dwmw2@infradead.org/ (local)
we know *precisely* how much the xtime reported as CLOCK_REALTIME
differs from what the kernel *intends* to report (but doesn't, to
ensure smoothness and monotonicity).

Some consumers of time would be very much better off with the "correct"
time, not the sanitized version — especially in a tickless kernel,
where the sanitized version can diverge *far* further from the ideal.

I believe we should be able to enable PPS on tickless kernel, if we
just give it a way to know the *true* time — which is basically as
simple as reporting the existing known ntp_error (plus... details).

This RFC patch adds an 'ntp_error' field to the system_time_snapshot
and a snapshot_ntp_error() helper function to calculate it.

There's a temporary ptp_vmclock debug hack that prints the offset and
applies it to the returned timestamp, for validating the field against
the host vmclock reference under QEMU.

Without this patch, my vmclock reference test reports errors of ±15ns
on a tickless kernel. With the patch, those errors are corrected and
the results are clamped back to the ±1ns that I see with a periodic
tick.

Open question: *how* should this be exposed? It's all very well putting
it into ktime_get_snapshot_id() like this, and we could easily make an
argument that pps_get_ts() should just add it unconditionally, because
*not* doing so makes no sense.

But what about PTP? Should we just start 'correcting' the clock values
we report in ptp_read_system_p{ost,re}ts() too? It's strictly an ABI
change?

And what about userspace which wants to discipline the kernel's
timekeeping? Surely NTP dæmons running on a tickless kernel want *this*
time rather than the approximation that CLOCK_REALTIME will give them?

Could we add it to adjtimex? Do we want to make a new clockid e.g.
CLOCK_REALTIME_PRECISE for it (we'd need it for each of the AUX clocks
too...).

I can't see a perfect answer. I'm inclined to use it unconditionally in
PPS, and probably in PTP too?

Signed-off-by: David Woodhouse <redacted>
Assisted-by: Kiro:claude-opus-4.8
---
 drivers/ptp/ptp_vmclock.c           |  2 +
 include/linux/timekeeper_internal.h |  6 +++
 include/linux/timekeeping.h         | 10 +++++
 kernel/time/timekeeping.c           | 67 ++++++++++++++++++++++++++++-
 4 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index c09ae06d7f68..37a9c8390055 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c

@@ -140,7 +140,9 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 			ptp_read_system_prets(sts);
 			if (sts->pre_sts.cs_id == st->cs_id) {
 				cycle = sts->pre_sts.cycles;
+				sts->pre_sts.systime += sts->pre_sts.ntp_error;
 				sts->post_sts = sts->pre_sts;
+				pr_info("vmclock pre error %lld\n", sts->pre_sts.ntp_error);
 			} else if (sts->pre_sts.hw_csid == st->cs_id &&
 				   sts->pre_sts.hw_cycles) {
 				cycle = sts->pre_sts.hw_cycles;

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 5dc7f8bf2740..b487e7d925fe 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h

@@ -97,6 +97,11 @@ struct tk_read_base {
  * @ntp_error_shift:		Shift conversion between clock shifted nano seconds and
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
+ * @ntp_err_frac:		Fractional part of the per-cycle NTP-ideal mult that the
+ *				integer @mult truncates, as a fraction of 2^32 in
+ *				clock-shifted nanoseconds per cycle. Used to
+ *				extrapolate @ntp_error to an arbitrary cycle count in
+ *				the lockless snapshot readers (ktime_get_snapshot_id).
  * @cs_tick_adj:		Per-second adjustment handed to NTP via ntp_clear()
  *				accounting for the difference between the nominal
  *				NTP interval and the real time taken by the

@@ -187,6 +192,7 @@ struct timekeeper {
 	s64			ntp_error;
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
+	u64			ntp_err_frac;
 	s64			cs_tick_adj;
 	u32			skip_second_overflow;
 	s64			skew_delta;

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 984a866d293b..6df32d5ead2d 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h

@@ -283,6 +283,15 @@ static inline bool ktime_get_aux_ts64(clockid_t id, struct timespec64 *kt) { ret
  *			which @cycles was derived
  * @systime:		The system time of the selected CLOCK ID
  * @monoraw:		Monotonic raw system time
+ * @ntp_error:		Signed nanosecond offset of @systime from the ideal
+ *			NTP-disciplined time at @cycles. Extrapolated to @cycles
+ *			(so it is exact even when many cycles have elapsed since the
+ *			last timekeeping update, e.g. on a NO_HZ kernel) and includes
+ *			the sub-nanosecond fraction dropped when @systime was
+ *			truncated to whole ns. A consumer lands on the ideal line by
+ *			adding @ntp_error directly to @systime. Zero for
+ *			CLOCK_MONOTONIC_RAW; valid for all NTP-disciplined clock ids
+ *			(CLOCK_REALTIME/MONOTONIC/BOOTTIME and the AUX clocks).
  * @cs_id:		Clocksource ID
  * @hw_csid:		Clocksource ID of the underlying hardware counter for derived
  *			clocksources which implement the read_snapshot() callback.

@@ -295,6 +304,7 @@ struct system_time_snapshot {
 	u64			hw_cycles;
 	ktime_t			systime;
 	ktime_t			monoraw;
+	s64			ntp_error;
 	enum clocksource_ids	cs_id;
 	enum clocksource_ids	hw_csid;
 	unsigned int		clock_was_set_seq;

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index de07ef65da32..a2cb588b5be8 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c

@@ -422,6 +422,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr_mono.mult = clock->mult;
 	tk->tkr_raw.mult = clock->mult;
 	tk->ntp_err_mult = 0;
+	tk->ntp_err_frac = 0;
 	tk->skip_second_overflow = 0;
 	tk->skew_delta = 0;

@@ -1226,6 +1227,54 @@ static inline u64 tk_clock_read_snapshot(const struct tk_read_base *tkr,
 	return clock->read(clock);
 }
 
+/*
+ * snapshot_ntp_error - record how far a snapshot's ::systime is from the
+ * ideal NTP-disciplined time at @now, in signed nanoseconds, so a caller
+ * can land exactly on the ideal line by adding it to ::systime.
+ *
+ * The value is summed in ns << NTP_SCALE_SHIFT from four parts:
+ *
+ *  - tk->ntp_error, the deviation accumulated as of the last timekeeping
+ *    update (tkr_mono.cycle_last);
+ *  - (cycle_delta * ntp_err_frac), the fractional-mult drift accrued over
+ *    the cycles read since then -- at most a tick on a tickful kernel, but
+ *    potentially many ticks' worth under NO_HZ;
+ *  - (cycle_delta * ntp_err_mult), subtracting the applied +1 mult dither
+ *    over the same span;
+ *  - the sub-nanosecond fraction that ::systime dropped when the read was
+ *    truncated to whole ns (the low @shift bits, exact even though the
+ *    multiply overflows).
+ *
+ * CLOCK_MONOTONIC_RAW is not NTP-disciplined and carries no error. Every
+ * other clock id uses its own timekeeper @tk -- including the AUX clocks,
+ * which each have their own NTP instance.
+ */
+static void snapshot_ntp_error(struct system_time_snapshot *snap,
+			       const struct timekeeper *tk, clockid_t clock_id,
+			       u64 now)
+{
+	u64 cycle_delta;
+	u32 nes;
+	s64 tmp, err;
+
+	if (clock_id == CLOCK_MONOTONIC_RAW) {
+		snap->ntp_error = 0;
+		return;
+	}
+
+	cycle_delta = (now - tk->tkr_mono.cycle_last) & tk->tkr_mono.mask;
+	nes = tk->ntp_error_shift;
+
+	err = tk->ntp_error;
+	err += ((s64)mul_u64_u64_shr(cycle_delta, tk->ntp_err_frac, 32) -
+		(s64)(cycle_delta * tk->ntp_err_mult)) << nes;
+
+	tmp = (s64)(cycle_delta * tk->tkr_mono.mult + tk->tkr_mono.xtime_nsec);
+	tmp &= (1ULL << tk->tkr_mono.shift) - 1;
+	err += tmp << nes;
+
+	snap->ntp_error = (err + (1LL << (NTP_SCALE_SHIFT - 1))) >> NTP_SCALE_SHIFT;
+}
 
 /**
  * ktime_get_snapshot_id -  Simultaneously snapshot a given clock ID with

@@ -1300,6 +1349,8 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
 
 		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
 		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
+
+		snapshot_ntp_error(systime_snapshot, tk, clock_id, now);
 	} while (read_seqcount_retry(&tkd->seq, seq));
 
 	systime_snapshot->cycles = now;

@@ -2447,6 +2498,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
 	u64 ntp_tl = ntp_tick_length(tk->id);
 	s64 skew = ntp_get_skew_delta(tk->id);
+	u64 dividend;
 	u32 mult;
 
 	/*

@@ -2467,8 +2519,19 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		 * scale it back up to the full per-tick rate for the mult bias.
 		 */
 		skew *= NTP_INTERVAL_FREQ;
-		mult = div64_u64((tk->ntp_tick + skew) >> tk->ntp_error_shift,
-				 tk->cycle_interval);
+		dividend = (tk->ntp_tick + skew) >> tk->ntp_error_shift;
+		mult = div64_u64(dividend, tk->cycle_interval);
+		/*
+		 * Stash the fractional part of the per-cycle ideal mult that
+		 * the integer @mult discards, scaled by 2^32, in clock-shifted
+		 * ns per cycle. The lockless snapshot readers use it to
+		 * extrapolate @ntp_error forward over the cycles accumulated
+		 * since the last tick (which on a NO_HZ kernel may be many
+		 * ticks' worth).
+		 */
+		tk->ntp_err_frac = div64_u64((dividend - (u64)mult *
+					      tk->cycle_interval) << 32,
+					     tk->cycle_interval);
 	}
 
 	/*

-- 
2.43.0

Attachments

smime.p7s [application/pkcs7-signature] 5069 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help